-
Notifications
You must be signed in to change notification settings - Fork 5
bpf: reject kfunc calls that overflow insn->imm #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Author
|
Master branch: edc21dc |
Author
|
Master branch: d2b94f3 |
Now kfunc call uses s32 to represent the offset between the address of kfunc and __bpf_call_base, but it doesn't check whether or not s32 will be overflowed. The overflow is possible when kfunc is in module and the offset between module and kernel is greater than 2GB. Take arm64 as an example, before commit b2eed9b ("arm64/kernel: kaslr: reduce module randomization range to 2 GB"), the offset between module symbol and __bpf_call_base will in 4GB range due to KASLR and may overflow s32. So add an extra checking to reject these invalid kfunc calls. Signed-off-by: Hou Tao <[email protected]> Acked-by: Yonghong Song <[email protected]>
65fd5a1 to
93e6d92
Compare
Author
|
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=614391 irrelevant now. Closing PR. |
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 5, 2022
Ido Schimmel says:
====================
HW counters for soft devices
Petr says:
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as a VLAN. In this patch set, add the necessary
infrastructure to allow exposing these statistics to the offloaded
netdevice in question, and add mlxsw offload.
Across HW platforms, the counter itself very likely constitutes a limited
resource, and the act of counting may have a performance impact. Therefore
this patch set makes the HW statistics collection opt-in and togglable from
userspace on a per-netdevice basis.
Additionally, HW devices may have various limiting conditions under which
they can realize the counter. Therefore it is also possible to query
whether the requested counter is realized by any driver. In TC parlance,
which is to a degree reused in this patch set, two values are recognized:
"request" tracks whether the user enabled collecting HW statistics, and
"used" tracks whether any HW statistics are actually collected.
In the past, this author has expressed the opinion that `a typical user
doing "ip -s l sh", including various scripts, wants to see the full
picture and not worry what's going on where'. While that would be nice,
unfortunately it cannot work:
- Packets that trap from the HW datapath to the SW datapath would be
double counted.
For a given netdevice, some traffic can be purely a SW artifact, and some
may flow through the HW object corresponding to the netdevice. But some
traffic can also get trapped to the SW datapath after bumping the HW
counter. It is not clear how to make sure double-counting does not occur
in the SW datapath in that case, while still making sure that possibly
divergent SW forwarding path gets bumped as appropriate.
So simply adding HW and SW stats may work roughly, most of the time, but
there are scenarios where the result is nonsensical.
- HW devices will have limitations as to what type of traffic they can
count.
In case of mlxsw, which is part of this patch set, there is no reasonable
way to count all traffic going through a certain netdevice, such as a
VLAN netdevice enslaved to a bridge. It is however very simple to count
traffic flowing through an L3 object, such as a VLAN netdevice with an IP
address.
Similarly for physical netdevices, the L3 object at which the counter is
installed is the subport carrying untagged traffic.
These are not "just counters". It is important that the user understands
what is being counted. It would be incorrect to conflate these statistics
with another existing statistics suite.
To that end, this patch set introduces a statistics suite called "L3
stats". This label should make it easy to understand what is being counted,
and to decide whether a given device can or cannot implement this suite for
some type of netdevice. At the same time, the code is written to make
future extensions easy, should a device pop up that can implement a
different flavor of statistics suite (say L2, or an address-family-specific
suite).
For example, using a work-in-progress iproute2[1], to turn on and then list
the counters on a VLAN netdevice:
# ip stats set dev swp1.200 l3_stats on
# ip stats show dev swp1.200 group offload subgroup l3_stats
56: swp1.200: group offload subgroup l3_stats on used on
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
The patchset progresses as follows:
- Patch #1 is a cleanup.
- In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are
dev-backed.
The only attribute defined under the nest is currently
IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the
driver that supplies the statistics is not the same as the driver that
implements the netdevice. Make the code compatible with this in patch #2.
- In patch #3, add the possibility to filter inside nests.
The filter_mask field of RTM_GETSTATS header determines which
top-level attributes should be included in the netlink response. This
saves processing time by only including the bits that the user cares
about instead of always dumping everything. This is doubly important
for HW-backed statistics that would typically require a trip to the
device to fetch the stats. In this patch, the UAPI is extended to
allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular,
but the scheme is easily extensible to other nests as well.
- In patch #4, propagate extack where we need it.
In patch #5, make it possible to propagate errors from drivers to the
user.
- In patch #6, add the in-kernel APIs for keeping track of the new stats
suite, and the notifiers that the core uses to communicate with the
drivers.
- In patch #7, add UAPI for obtaining the new stats suite.
- In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry
the message to toggle the newly-added stats suite.
In patch #9, add the toggle itself.
At this point the core is ready for drivers to add support for the new
stats suite.
- In patches #10, #11 and #12, apply small tweaks to mlxsw code.
- In patch #13, add support for L3 stats, which are realized as RIF
counters.
- Finally in patch #14, a selftest is added to the net/forwarding
directory. Technically this is a HW-specific test, in that without a HW
implementing the counters, it just will not pass. But devices that
support L3 statistics at all are likely to be able to reuse this
selftest, so it seems appropriate to put it in the general forwarding
directory.
We also have a netdevsim implementation, and a corresponding selftest that
verifies specifically some of the core code. We intend to contribute these
later. Interested parties can take a look at the raw code at [2].
[1] https://github.com/pmachata/iproute2/commits/soft_counters
[2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2
v2:
- Patch #3:
- Do not declare strict_start_type at the new policies, since they are
used with nla_parse_nested() (sans _deprecated).
- Use NLA_POLICY_NESTED to declare what the nest contents should be
- Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering
attribute.
- Patch #6:
- s/monotonous/monotonic/ in commit message
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #7:
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #8:
- Do not declare strict_start_type at the new policies, since they are
used with nla_parse_nested() (sans _deprecated).
- Patch #13:
- Use a newly-added struct rtnl_hw_stats64 for stats transfer
====================
Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 16, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 16, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 16, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 17, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 17, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 17, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 17, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 18, 2022
We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: [email protected] # 5.10+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
kernel-patches-bot
pushed a commit
that referenced
this pull request
Mar 19, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number. Therefore, this patch tries to
convert the offsets.
The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.
FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.
Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.
For example, for the following bpftrace command:
bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Without this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: mov x25, sp
1c: mov x26, #0x0 // #0
20: bti j
24: sub sp, sp, #0x90
28: add x19, x0, #0x0
2c: mov x0, #0x0 // #0
30: mov x10, #0xffffffffffffff78 // #-136
34: str x0, [x25, x10]
38: mov x10, #0xffffffffffffff80 // #-128
3c: str x0, [x25, x10]
40: mov x10, #0xffffffffffffff88 // #-120
44: str x0, [x25, x10]
48: mov x10, #0xffffffffffffff90 // #-112
4c: str x0, [x25, x10]
50: mov x10, #0xffffffffffffff98 // #-104
54: str x0, [x25, x10]
58: mov x10, #0xffffffffffffffa0 // #-96
5c: str x0, [x25, x10]
60: mov x10, #0xffffffffffffffa8 // #-88
64: str x0, [x25, x10]
68: mov x10, #0xffffffffffffffb0 // #-80
6c: str x0, [x25, x10]
70: mov x10, #0xffffffffffffffb8 // #-72
74: str x0, [x25, x10]
78: mov x10, #0xffffffffffffffc0 // #-64
7c: str x0, [x25, x10]
80: mov x10, #0xffffffffffffffc8 // #-56
84: str x0, [x25, x10]
88: mov x10, #0xffffffffffffffd0 // #-48
8c: str x0, [x25, x10]
90: mov x10, #0xffffffffffffffd8 // #-40
94: str x0, [x25, x10]
98: mov x10, #0xffffffffffffffe0 // #-32
9c: str x0, [x25, x10]
a0: mov x10, #0xffffffffffffffe8 // #-24
a4: str x0, [x25, x10]
a8: mov x10, #0xfffffffffffffff0 // #-16
ac: str x0, [x25, x10]
b0: mov x10, #0xfffffffffffffff8 // #-8
b4: str x0, [x25, x10]
b8: mov x10, #0x8 // #8
bc: ldr x2, [x19, x10]
[...]
With this patch, jited code(fragment):
0: bti c
4: stp x29, x30, [sp, #-16]!
8: mov x29, sp
c: stp x19, x20, [sp, #-16]!
10: stp x21, x22, [sp, #-16]!
14: stp x25, x26, [sp, #-16]!
18: stp x27, x28, [sp, #-16]!
1c: mov x25, sp
20: sub x27, x25, #0x88
24: mov x26, #0x0 // #0
28: bti j
2c: sub sp, sp, #0x90
30: add x19, x0, #0x0
34: mov x0, #0x0 // #0
38: str x0, [x27]
3c: str x0, [x27, #8]
40: str x0, [x27, #16]
44: str x0, [x27, #24]
48: str x0, [x27, #32]
4c: str x0, [x27, #40]
50: str x0, [x27, #48]
54: str x0, [x27, #56]
58: str x0, [x27, #64]
5c: str x0, [x27, #72]
60: str x0, [x27, #80]
64: str x0, [x27, #88]
68: str x0, [x27, #96]
6c: str x0, [x27, #104]
70: str x0, [x27, #112]
74: str x0, [x27, #120]
78: str x0, [x27, #128]
7c: ldr x2, [x19, #8]
[...]
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 1, 2025
perf test 11 hwmon fails on s390 with this error
# ./perf test -Fv 11
--- start ---
---- end ----
11.1: Basic parsing test : Ok
--- start ---
Testing 'temp_test_hwmon_event1'
Using CPUID IBM,3931,704,A01,3.7,002f
temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/
FAILED tests/hwmon_pmu.c:189 Unexpected config for
'temp_test_hwmon_event1', 292470092988416 != 655361
---- end ----
11.2: Parsing without PMU name : FAILED!
--- start ---
Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/'
FAILED tests/hwmon_pmu.c:189 Unexpected config for
'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/',
292470092988416 != 655361
---- end ----
11.3: Parsing with PMU name : FAILED!
#
The root cause is in member test_event::config which is initialized
to 0xA0001 or 655361. During event parsing a long list event parsing
functions are called and end up with this gdb call stack:
#0 hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8,
term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623
#1 hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8,
terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662
#2 0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0,
attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false,
apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519
#3 0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8,
head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8)
at util/pmu.c:1545
#4 0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8,
list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090,
auto_merge_stats=true, alternate_hw_config=10)
at util/parse-events.c:1508
#5 0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8,
event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10,
const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0)
at util/parse-events.c:1592
#6 0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8,
scanner=0x16878c0) at util/parse-events.y:293
#7 0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8
"temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8)
at util/parse-events.c:1867
#8 0x000000000126a1e8 in __parse_events (evlist=0x168b580,
str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0,
err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true,
fake_tp=false) at util/parse-events.c:2136
#9 0x00000000011e36aa in parse_events (evlist=0x168b580,
str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8)
at /root/linux/tools/perf/util/parse-events.h:41
#10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false)
at tests/hwmon_pmu.c:164
#11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false)
at tests/hwmon_pmu.c:219
#12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368
<suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23
where the attr::config is set to value 292470092988416 or 0x10a0000000000
in line 625 of file ./util/hwmon_pmu.c:
attr->config = key.type_and_num;
However member key::type_and_num is defined as union and bit field:
union hwmon_pmu_event_key {
long type_and_num;
struct {
int num :16;
enum hwmon_type type :8;
};
};
s390 is big endian and Intel is little endian architecture.
The events for the hwmon dummy pmu have num = 1 or num = 2 and
type is set to HWMON_TYPE_TEMP (which is 10).
On s390 this assignes member key::type_and_num the value of
0x10a0000000000 (which is 292470092988416) as shown in above
trace output.
Fix this and export the structure/union hwmon_pmu_event_key
so the test shares the same implementation as the event parsing
functions for union and bit fields. This should avoid
endianess issues on all platforms.
Output after:
# ./perf test -F 11
11.1: Basic parsing test : Ok
11.2: Parsing without PMU name : Ok
11.3: Parsing with PMU name : Ok
#
Fixes: 531ee0f ("perf test: Add hwmon "PMU" test")
Signed-off-by: Thomas Richter <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 1, 2025
Ian told me that there are many memory leaks in the hierarchy mode. I
can easily reproduce it with the follwing command.
$ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak
$ perf record --latency -g -- ./perf test -w thloop
$ perf report -H --stdio
...
Indirect leak of 168 byte(s) in 21 object(s) allocated from:
#0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75
#1 0x55ed3602346e in map__get util/map.h:189
#2 0x55ed36024cc4 in hist_entry__init util/hist.c:476
#3 0x55ed36025208 in hist_entry__new util/hist.c:588
#4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587
#5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638
#6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685
#7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776
#8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735
#9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119
#10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867
#11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351
#12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404
#13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448
#14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556
#15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
...
$ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
93
I found that hist_entry__delete() missed to release child entries in the
hierarchy tree (hroot_{in,out}). It needs to iterate the child entries
and call hist_entry__delete() recursively.
After this change:
$ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
0
Reported-by: Ian Rogers <[email protected]>
Tested-by Thomas Falcon <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 1, 2025
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD.
For a pipe data, it reads the header data including pmu_mapping from
PERF_RECORD_HEADER_FEATURE runtime. But it's already set in:
perf_session__new()
__perf_session__new()
evlist__init_trace_event_sample_raw()
evlist__has_amd_ibs()
perf_env__nr_pmu_mappings()
Then it'll overwrite that when it processes the HEADER_FEATURE record.
Here's a report from address sanitizer.
Direct leak of 2689 byte(s) in 1 object(s) allocated from:
#0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
#1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64
#2 0x5595a7d414ef in strbuf_init util/strbuf.c:25
#3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362
#4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517
#5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315
#6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23
#7 0x5595a7d7f893 in __perf_session__new util/session.c:179
#8 0x5595a7b79572 in perf_session__new util/session.h:115
#9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603
#10 0x5595a7c019eb in run_builtin perf.c:351
#11 0x5595a7c01c92 in handle_internal_command perf.c:404
#12 0x5595a7c01deb in run_argv perf.c:448
#13 0x5595a7c02134 in main perf.c:556
#14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
Let's free the existing pmu_mapping data if any.
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 3, 2025
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().
An example from v5.4, similar problem also exists in upstream:
crash> bt 2091206
PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0"
#0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
#1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
#2 [ffff800084a2f880] schedule at ffff800040bfa4b4
#3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
#4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
#5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
#6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
#7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
#8 [ffff800084a2fa60] generic_make_request at ffff800040570138
#9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
#10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
#18 [ffff800084a2fe70] kthread at ffff800040118de4
After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.
Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().
Signed-off-by: Jinliang Zheng <[email protected]>
Reviewed-by: Tianxiang Peng <[email protected]>
Reviewed-by: Hao Peng <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 5, 2025
v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text
Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.
Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.
I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:
Setting pool0/active to 0
Setting pool1/active to 1
[ 73.911067][ T4365] ibmveth 30000002 eth0: close starting
Setting pool1/active to 1
Setting pool1/active to 0
[ 73.911367][ T4366] ibmveth 30000002 eth0: close starting
[ 73.916056][ T4365] ibmveth 30000002 eth0: close complete
[ 73.916064][ T4365] ibmveth 30000002 eth0: open starting
[ 110.808564][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 230.808495][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 243.683786][ T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
[ 243.683827][ T123] Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
[ 243.683833][ T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.683838][ T123] task:stress.sh state:D stack:28096 pid:4365 tgid:4365 ppid:4364 task_flags:0x400040 flags:0x00042000
[ 243.683852][ T123] Call Trace:
[ 243.683857][ T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
[ 243.683868][ T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
[ 243.683878][ T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
[ 243.683888][ T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
[ 243.683896][ T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
[ 243.683904][ T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
[ 243.683913][ T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
[ 243.683921][ T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
[ 243.683928][ T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
[ 243.683936][ T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
[ 243.683944][ T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
[ 243.683951][ T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
[ 243.683958][ T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
[ 243.683966][ T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
[ 243.683973][ T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
...
[ 243.684087][ T123] Showing all locks held in the system:
[ 243.684095][ T123] 1 lock held by khungtaskd/123:
[ 243.684099][ T123] #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
[ 243.684114][ T123] 4 locks held by stress.sh/4365:
[ 243.684119][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684132][ T123] #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684143][ T123] #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684155][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
[ 243.684166][ T123] 5 locks held by stress.sh/4366:
[ 243.684170][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684183][ T123] #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684194][ T123] #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684205][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
[ 243.684216][ T123] #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0
From the ibmveth debug, two threads are calling veth_pool_store, which
calls ibmveth_close and ibmveth_open. Here's the sequence:
T4365 T4366
----------------- ----------------- ---------
veth_pool_store veth_pool_store
ibmveth_close
ibmveth_close
napi_disable
napi_disable
ibmveth_open
napi_enable <- HANG
ibmveth_close calls napi_disable at the top and ibmveth_open calls
napi_enable at the top.
https://docs.kernel.org/networking/napi.html]] says
The control APIs are not idempotent. Control API calls are safe
against concurrent use of datapath APIs but an incorrect sequence of
control API calls may result in crashes, deadlocks, or race
conditions. For example, calling napi_disable() multiple times in a
row will deadlock.
In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.
Signed-off-by: Dave Marquardt <[email protected]>
Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
Reviewed-by: Nick Child <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 10, 2025
As reported by CVE-2025-29481 [1], it is possible to corrupt a BPF ELF file such that arbitrary BPF instructions are loaded by libbpf. This can be done by setting a symbol (BPF program) section offset to a large (unsigned) number such that <section start + symbol offset> overflows and points before the section data in the memory. Consider the situation below where: - prog_start = sec_start + symbol_offset <-- size_t overflow here - prog_end = prog_start + prog_size prog_start sec_start prog_end sec_end | | | | v v v v .....................|################################|............ The CVE report in [1] also provides a corrupted BPF ELF which can be used as a reproducer: $ readelf -S crash Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040 0000000000000068 0000000000000000 AX 0 0 8 $ readelf -s crash Symbol table '.symtab' contains 8 entries: Num: Value Size Type Bind Vis Ndx Name ... 6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will point before the actual memory where section 2 is allocated. This is also reported by AddressSanitizer: ================================================================= ==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490 READ of size 104 at 0x7c7302fe0000 thread T0 #0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76) #1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856 #2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928 #3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930 #4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067 #5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090 #6 0x000000400c16 in main /poc/poc.c:8 #7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) #8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) #9 0x000000400b34 in _start (/poc/poc+0x400b34) 0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8) allocated by thread T0 here: #0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b) #1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600) #2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018) #3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740 The problem here is that currently, libbpf only checks that the program end is within the section bounds. There used to be a check `while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was removed by commit 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions"). Put the above condition back to bpf_object__init_prog to make sure that the program start is also within the bounds of the section to avoid the potential buffer overflow. [1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md Reported-by: lmarch2 <[email protected]> Cc: [email protected] Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions") Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md Link: https://www.cve.org/CVERecord?id=CVE-2025-29481 Signed-off-by: Viktor Malik <[email protected]> Reviewed-by: Shung-Hsi Yu <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 10, 2025
As reported by CVE-2025-29481 [1], it is possible to corrupt a BPF ELF file such that arbitrary BPF instructions are loaded by libbpf. This can be done by setting a symbol (BPF program) section offset to a large (unsigned) number such that <section start + symbol offset> overflows and points before the section data in the memory. Consider the situation below where: - prog_start = sec_start + symbol_offset <-- size_t overflow here - prog_end = prog_start + prog_size prog_start sec_start prog_end sec_end | | | | v v v v .....................|################################|............ The CVE report in [1] also provides a corrupted BPF ELF which can be used as a reproducer: $ readelf -S crash Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040 0000000000000068 0000000000000000 AX 0 0 8 $ readelf -s crash Symbol table '.symtab' contains 8 entries: Num: Value Size Type Bind Vis Ndx Name ... 6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will point before the actual memory where section 2 is allocated. This is also reported by AddressSanitizer: ================================================================= ==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490 READ of size 104 at 0x7c7302fe0000 thread T0 #0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76) #1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856 #2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928 #3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930 #4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067 #5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090 #6 0x000000400c16 in main /poc/poc.c:8 #7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) #8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) #9 0x000000400b34 in _start (/poc/poc+0x400b34) 0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8) allocated by thread T0 here: #0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b) #1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600) #2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018) #3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740 The problem here is that currently, libbpf only checks that the program end is within the section bounds. There used to be a check `while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was removed by commit 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions"). Put the above condition back to bpf_object__init_prog to make sure that the program start is also within the bounds of the section to avoid the potential buffer overflow. [1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md Reported-by: lmarch2 <[email protected]> Cc: [email protected] Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions") Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md Link: https://www.cve.org/CVERecord?id=CVE-2025-29481 Signed-off-by: Viktor Malik <[email protected]> Reviewed-by: Shung-Hsi Yu <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 15, 2025
As shown in [1], it is possible to corrupt a BPF ELF file such that
arbitrary BPF instructions are loaded by libbpf. This can be done by
setting a symbol (BPF program) section offset to a large (unsigned)
number such that <section start + symbol offset> overflows and points
before the section data in the memory.
Consider the situation below where:
- prog_start = sec_start + symbol_offset <-- size_t overflow here
- prog_end = prog_start + prog_size
prog_start sec_start prog_end sec_end
| | | |
v v v v
.....................|################################|............
The report in [1] also provides a corrupted BPF ELF which can be used as
a reproducer:
$ readelf -S crash
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040
0000000000000068 0000000000000000 AX 0 0 8
$ readelf -s crash
Symbol table '.symtab' contains 8 entries:
Num: Value Size Type Bind Vis Ndx Name
...
6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp
Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will
point before the actual memory where section 2 is allocated.
This is also reported by AddressSanitizer:
=================================================================
==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490
READ of size 104 at 0x7c7302fe0000 thread T0
#0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76)
#1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856
#2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928
#3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930
#4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067
#5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090
#6 0x000000400c16 in main /poc/poc.c:8
#7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4)
#8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667)
#9 0x000000400b34 in _start (/poc/poc+0x400b34)
0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8)
allocated by thread T0 here:
#0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b)
#1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600)
#2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018)
#3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740
The problem here is that currently, libbpf only checks that the program
end is within the section bounds. There used to be a check
`while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was
removed by commit 6245947 ("libbpf: Allow gaps in BPF program
sections to support overriden weak functions").
Add a check for detecting the overflow of `sec_off + prog_sz` to
bpf_object__init_prog to fix this issue.
[1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Reported-by: lmarch2 <[email protected]>
Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions")
Signed-off-by: Viktor Malik <[email protected]>
Reviewed-by: Shung-Hsi Yu <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 15, 2025
As shown in [1], it is possible to corrupt a BPF ELF file such that
arbitrary BPF instructions are loaded by libbpf. This can be done by
setting a symbol (BPF program) section offset to a large (unsigned)
number such that <section start + symbol offset> overflows and points
before the section data in the memory.
Consider the situation below where:
- prog_start = sec_start + symbol_offset <-- size_t overflow here
- prog_end = prog_start + prog_size
prog_start sec_start prog_end sec_end
| | | |
v v v v
.....................|################################|............
The report in [1] also provides a corrupted BPF ELF which can be used as
a reproducer:
$ readelf -S crash
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040
0000000000000068 0000000000000000 AX 0 0 8
$ readelf -s crash
Symbol table '.symtab' contains 8 entries:
Num: Value Size Type Bind Vis Ndx Name
...
6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp
Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will
point before the actual memory where section 2 is allocated.
This is also reported by AddressSanitizer:
=================================================================
==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490
READ of size 104 at 0x7c7302fe0000 thread T0
#0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76)
#1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856
#2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928
#3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930
#4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067
#5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090
#6 0x000000400c16 in main /poc/poc.c:8
#7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4)
#8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667)
#9 0x000000400b34 in _start (/poc/poc+0x400b34)
0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8)
allocated by thread T0 here:
#0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b)
#1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600)
#2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018)
#3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740
The problem here is that currently, libbpf only checks that the program
end is within the section bounds. There used to be a check
`while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was
removed by commit 6245947 ("libbpf: Allow gaps in BPF program
sections to support overriden weak functions").
Add a check for detecting the overflow of `sec_off + prog_sz` to
bpf_object__init_prog to fix this issue.
[1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Reported-by: lmarch2 <[email protected]>
Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions")
Signed-off-by: Viktor Malik <[email protected]>
Reviewed-by: Shung-Hsi Yu <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 15, 2025
As shown in [1], it is possible to corrupt a BPF ELF file such that
arbitrary BPF instructions are loaded by libbpf. This can be done by
setting a symbol (BPF program) section offset to a large (unsigned)
number such that <section start + symbol offset> overflows and points
before the section data in the memory.
Consider the situation below where:
- prog_start = sec_start + symbol_offset <-- size_t overflow here
- prog_end = prog_start + prog_size
prog_start sec_start prog_end sec_end
| | | |
v v v v
.....................|################################|............
The report in [1] also provides a corrupted BPF ELF which can be used as
a reproducer:
$ readelf -S crash
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040
0000000000000068 0000000000000000 AX 0 0 8
$ readelf -s crash
Symbol table '.symtab' contains 8 entries:
Num: Value Size Type Bind Vis Ndx Name
...
6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp
Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will
point before the actual memory where section 2 is allocated.
This is also reported by AddressSanitizer:
=================================================================
==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490
READ of size 104 at 0x7c7302fe0000 thread T0
#0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76)
#1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856
#2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928
#3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930
#4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067
#5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090
#6 0x000000400c16 in main /poc/poc.c:8
#7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4)
#8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667)
#9 0x000000400b34 in _start (/poc/poc+0x400b34)
0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8)
allocated by thread T0 here:
#0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b)
#1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600)
#2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018)
#3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740
The problem here is that currently, libbpf only checks that the program
end is within the section bounds. There used to be a check
`while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was
removed by commit 6245947 ("libbpf: Allow gaps in BPF program
sections to support overriden weak functions").
Add a check for detecting the overflow of `sec_off + prog_sz` to
bpf_object__init_prog to fix this issue.
[1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Reported-by: lmarch2 <[email protected]>
Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions")
Signed-off-by: Viktor Malik <[email protected]>
Reviewed-by: Shung-Hsi Yu <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Apr 15, 2025
As shown in [1], it is possible to corrupt a BPF ELF file such that
arbitrary BPF instructions are loaded by libbpf. This can be done by
setting a symbol (BPF program) section offset to a large (unsigned)
number such that <section start + symbol offset> overflows and points
before the section data in the memory.
Consider the situation below where:
- prog_start = sec_start + symbol_offset <-- size_t overflow here
- prog_end = prog_start + prog_size
prog_start sec_start prog_end sec_end
| | | |
v v v v
.....................|################################|............
The report in [1] also provides a corrupted BPF ELF which can be used as
a reproducer:
$ readelf -S crash
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040
0000000000000068 0000000000000000 AX 0 0 8
$ readelf -s crash
Symbol table '.symtab' contains 8 entries:
Num: Value Size Type Bind Vis Ndx Name
...
6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp
Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will
point before the actual memory where section 2 is allocated.
This is also reported by AddressSanitizer:
=================================================================
==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490
READ of size 104 at 0x7c7302fe0000 thread T0
#0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76)
#1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856
#2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928
#3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930
#4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067
#5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090
#6 0x000000400c16 in main /poc/poc.c:8
#7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4)
#8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667)
#9 0x000000400b34 in _start (/poc/poc+0x400b34)
0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8)
allocated by thread T0 here:
#0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b)
#1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600)
#2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018)
#3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740
The problem here is that currently, libbpf only checks that the program
end is within the section bounds. There used to be a check
`while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was
removed by commit 6245947 ("libbpf: Allow gaps in BPF program
sections to support overriden weak functions").
Add a check for detecting the overflow of `sec_off + prog_sz` to
bpf_object__init_prog to fix this issue.
[1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions")
Reported-by: lmarch2 <[email protected]>
Signed-off-by: Viktor Malik <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Reviewed-by: Shung-Hsi Yu <[email protected]>
Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Link: https://lore.kernel.org/bpf/[email protected]
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
May 29, 2025
Biju Das <[email protected]> says: The CAN-FD module on RZ/G3E is very similar to the one on both R-Car V4H and RZ/G2L, but differs in some hardware parameters: * No external clock, but instead has ram clock. * Support up to 6 channels. * 20 interrupts. v8->v9: * Collected tags. * Added missing header bitfield.h. * Fixed logical error ch->BIT(ch) in rcar_canfd_global_error(). * Removed unneeded double space in rcar_canfd_setrnc(). * Updated commit description in patch#15. v7->v8: * Collected tags. * Updated commit description for patch#{5,9,15,16,17}. * Replaced the macro RCANFD_GERFL_EEF0_7->RCANFD_GERFL_EEF. * Dropped the redundant macro RCANFD_GERFL_EEF(ch). * Added patch for dropping the mask operation in RCANFD_GAFLCFG_SETRNC macro. * Converted RCANFD_GAFLCFG_SETRNC->rcar_canfd_setrnc(). * Updated RCANFD_GAFLCFG macro by replacing the parameter ch->w, where w is the GAFLCFG index used in the hardware manual. * Renamed the parameter x->page_num in RCANFD_GAFLECTR_AFLPN macro to make it clear. * Renamed the parameter x->cftml in RCANFD_CFCC_CFTML macro to make it clear. * Updated {rzg2l,car_gen3_hw_info} with ch_interface_mode = 0. * Updated {rzg2l,rcar_gen3}_hw_info with shared_can_regs = 0. * Started using struct rcanfd_regs instead of LUT for reg offsets. * Started using struct rcar_canfd_shift_data instead of LUT for shift data. * Renamed only_internal_clks->external_clk to avoid negation. * Updated rcar_canfd_hw_info tables with external_clk entries. * Replaced 10->sizeof(name) in scnprintf(). v6->v7: * Collected tags * Replaced 'aswell'->'as well' in patch#11 commit description. v5->v6: * Replaced RCANFD_RNC_PER_REG macro with rnc_stride variable. * Updated commit description for patch#7 and #8 * Dropped mask_table: AFLPN_MASK is replaced by max_aflpn variable. CFTML_MASK is replaced by max_cftml variable. BITTIMING MASK's are replaced by {nom,data}_bittiming variables. * Collected tag from Geert. v4->v5: * Collected tag from Geert. * The rules for R-Car Gen3/4 could be kept together, reducing the number of lines. Similar change for rzg2l-canfd aswell. * Keeping interrupts and resets together allows to keep a clear separation between RZ/G2L and RZ/G3E, at the expense of only a single line. * Retained the tags for binding patches as it is trivial changes. * Dropped the unused macro RCANFD_GAFLCFG_GETRNC. * Updated macro RCANFD_GERFL_ERR by using gpriv->channels_mask and dropped unused macro RCANFD_GERFL_EEF0_7. * Replaced RNC mask in RCANFD_GAFLCFG_SETRNC macro by using info->num_supported_rules variable. * Updated the macro RCANFD_GAFLCFG by using info->rnc_field_width variable. * Updated shift value in RCANFD_GAFLCFG_SETRNC macro by using a formula (32 - (n % rnc_per_reg + 1) * field_width). * Replaced the variable name shared_can_reg->shared_can_regs. * Improved commit description for patch{#11,#12}by replacing has->have. * Dropped RCANFD_EEF_MASK and RCANFD_RNC_MASK as it is taken care by gpriv->channels_mask and info->num_supported_rules. * Dropped RCANFD_FIRST_RNC_SH and RCANFD_SECOND_RNC_SH by using a formula (32 - (n % rnc_per_reg + 1) * rnc_field_width. * Improved commit description by "All SoCs supports extenal clock"-> "All existing SoCs support an external clock". * Updated error description in probe as "cannot get enabled ram clock" * Updated r9a09g047_hw_info table. v3->v4: * Added Rb tag from Rob for patch#2. * Added prefix RCANFD_* to enum rcar_canfd_reg_offset_id. * Added prefix RCANFD_* to enum rcar_canfd_mask_id. * Added prefix RCANFD_* to enum rcar_canfd_shift_id. v2->v3: * Collected tags. * Dropped reg_gen4() and is_gen4() by adding mask_table, shift_table, regs, ch_interface_mode and shared_can_reg variables to struct rcar_canfd_hw_info. v1->v2: * Split the series with fixes patch separately. * Added patch for Simplify rcar_canfd_probe() using of_get_available_child_by_name() as dependency patch hit on can-next. * Added Rb tag from Vincent Mailhol. * Dropped redundant comment from commit description for patch#3. Link: https://patch.msgid.link/[email protected] Signed-off-by: Marc Kleine-Budde <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Jun 11, 2025
…kedges'
Eduard Zingerman says:
====================
bpf: propagate read/precision marks over state graph backedges
Current loop_entry-based states comparison logic does not handle the
following case:
.-> A --. Assume the states are visited in the order A, B, C.
| | | Assume that state B reaches a state equivalent to state A.
| v v At this point, state C is not processed yet, so state A
'-- B C has not received any read or precision marks from C.
As a result, these marks won't be propagated to B.
If B has incomplete marks, it is unsafe to use it in states_equal()
checks. This issue was first reported in [1].
This patch-set
--------------
Here is the gist of the algorithm implemented by this patch-set:
- Compute strongly connected components (SCCs) in the program CFG.
- When a verifier state enters an SCC, that state is recorded as the
SCC's entry point.
- When a verifier state is found to be equivalent to another
(e.g., B to A in the example above), it is recorded as a
states-graph backedge.
- Backedges are accumulated per SCC (*).
- When an SCC entry state reaches `branches == 0`, propagate read and
precision marks through the backedges until a fixed point is reached
(e.g., from A to B, from C to A, and then again from A to B).
(*) This is an oversimplification, see patch #8 for details.
Unfortunately, this means that commit [2] needs to be reverted,
as precision propagation requires access to jump history,
and backedges represent history not belonging to `env->cur_state`.
Details are provided in patch #8; a comment in `is_state_visited()`
explains most of the mechanics.
Patch #2 adds a `compute_scc()` function, which computes SCCs in the
program CFG. This function was tested using property-based testing in
[3], but it is not included in selftests.
Previous attempt
----------------
A previous attempt to fix this is described in [4]:
1. Within the states loop, `states_equal(... RANGE_WITHIN)` ignores
read and precision marks.
2. For states outside the loop, all registers for states within the
loop are marked as read and precise.
This approach led to an 86x regression on the `cond_break1` selftest.
In that test, one loop was followed by another, and a certain variable
was incremented in the second loop. This variable was marked as
precise due to rule (2), which hindered convergence in the first loop.
After some off-list discussion, it was decided that this might be a
typical case and such regressions are undesirable.
This patch-set avoids such eager precision markings.
Alternatives
------------
Another option is to associate a mask of read/written/precise stack
slots with each instruction. This mask can be populated during
verifier states exploration. Upon reaching an `EXIT` instruction or an
equivalent state, the accumulated masks can be used to propagate
read/written/precise bits across the program's control flow graph
using an analysis similar to use-def.
Unfortunately, a naive implementation of this approach [5] results in
a 10x regression in `veristat` for some `sched_ext` programs due to
the inability to express the must-write property. This issue requires
further investigation.
Changes in verification performance
-----------------------------------
There are some veristat regressions when comparing with master using
selftests and sched_ext BPF binaries. The comparison is done using
master from [6] and this patch-set from [7] where memory accounting
logic is added to veristat.
========= selftests: master vs patch-set =========
File Program Insns Peak memory (KiB)
--------------------- ----------------------------------- ----- ----- ---------------- ---- ----- ----------------
bpf_qdisc_fq.bpf.o bpf_fq_dequeue 1187 1645 +458 (+38.58%) 768 1240 +472 (+61.46%)
dynptr_success.bpf.o test_copy_from_user_str_dynptr 208 279 +71 (+34.13%) 512 1024 +512 (+100.00%)
dynptr_success.bpf.o test_copy_from_user_task_str_dynptr 205 263 +58 (+28.29%) 512 1024 +512 (+100.00%)
dynptr_success.bpf.o test_probe_read_kernel_str_dynptr 686 857 +171 (+24.93%) 992 1724 +732 (+73.79%)
dynptr_success.bpf.o test_probe_read_user_str_dynptr 689 860 +171 (+24.82%) 1016 1744 +728 (+71.65%)
iters.bpf.o checkpoint_states_deletion 1211 1216 +5 (+0.41%) 512 1280 +768 (+150.00%)
pyperf600_iter.bpf.o on_event 2591 5929 +3338 (+128.83%) 4744 11176 +6432 (+135.58%)
verifier_gotol.bpf.o gotol_large_imm 40004 40004 +0 (+0.00%) 1024 1536 +512 (+50.00%)
Total progs: 3725
Old success: 2157
New success: 2157
total_insns diff min: 0.00%
total_insns diff max: 128.83%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 837,487
total_insns abs max new: 837,487
0 .. 5 %: 3710
5 .. 15 %: 6
20 .. 30 %: 6
30 .. 40 %: 2
125 .. 130 %: 1
mem_peak diff min: -27.78%
mem_peak diff max: 198.44%
mem_peak abs max old: 269,312 KiB
mem_peak abs max new: 269,312 KiB
-30 .. -20 %: 1
-5 .. 0 %: 18
0 .. 5 %: 3568
5 .. 15 %: 4
15 .. 25 %: 3
45 .. 55 %: 4
60 .. 70 %: 1
70 .. 80 %: 2
100 .. 110 %: 3
135 .. 145 %: 1
150 .. 160 %: 1
195 .. 200 %: 1
========= scx: master vs patch-set =========
Program Insns Peak memory (KiB)
------------------------ ----- ----- --------------- ----- ----- -----------------
arena_topology_node_init 2133 2395 +262 (+12.28%) 768 768 +0 (+0.00%)
chaos_dispatch 2835 2868 +33 (+1.16%) 1972 1720 -252 (-12.78%)
chaos_init 4324 5210 +886 (+20.49%) 2528 3028 +500 (+19.78%)
lavd_cpu_offline 5107 5726 +619 (+12.12%) 4188 6304 +2116 (+50.53%)
lavd_cpu_online 5107 5726 +619 (+12.12%) 4188 6304 +2116 (+50.53%)
lavd_dispatch 41775 47601 +5826 (+13.95%) 6196 29192 +22996 (+371.14%)
lavd_enqueue 20238 24188 +3950 (+19.52%) 22084 42156 +20072 (+90.89%)
lavd_init 6974 7685 +711 (+10.20%) 5428 6928 +1500 (+27.63%)
lavd_select_cpu 22138 26088 +3950 (+17.84%) 24448 43688 +19240 (+78.70%)
layered_dispatch 17847 26581 +8734 (+48.94%) 11728 28740 +17012 (+145.05%)
layered_dump 1891 2098 +207 (+10.95%) 2036 3048 +1012 (+49.71%)
layered_runnable 2606 2634 +28 (+1.07%) 748 1240 +492 (+65.78%)
p2dq_init 3691 4554 +863 (+23.38%) 2016 2528 +512 (+25.40%)
rusty_enqueue 28853 28853 +0 (+0.00%) 2072 1824 -248 (-11.97%)
rusty_init_task 31128 31128 +0 (+0.00%) 2176 2560 +384 (+17.65%)
Total progs: 148
Old success: 135
New success: 135
total_insns diff min: 0.00%
total_insns diff max: 48.94%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 41,775
total_insns abs max new: 47,601
0 .. 5 %: 133
5 .. 15 %: 7
15 .. 25 %: 4
35 .. 45 %: 3
45 .. 50 %: 1
mem_peak diff min: -12.78%
mem_peak diff max: 371.14%
mem_peak abs max old: 24,448 KiB
mem_peak abs max new: 43,688 KiB
-15 .. -5 %: 2
-5 .. 0 %: 2
0 .. 5 %: 129
5 .. 15 %: 1
15 .. 25 %: 2
25 .. 35 %: 2
45 .. 55 %: 3
65 .. 75 %: 1
75 .. 85 %: 1
90 .. 100 %: 1
145 .. 155 %: 1
195 .. 205 %: 1
370 .. 375 %: 1
Changelog
---------
v1: https://lore.kernel.org/bpf/[email protected]/
v1 -> v2:
- Rebase
- added mem_peak statistics (Alexei)
- selftests: fixed comments and removed useless r7 assignments (Yonghong)
v2: https://lore.kernel.org/bpf/[email protected]/
v2 -> v3:
- Rebase
Links
-----
[1] https://lore.kernel.org/bpf/[email protected]/
[2] commit 96a30e4 ("bpf: use common instruction history across all states")
[3] https://github.com/eddyz87/scc-test
[4] https://lore.kernel.org/bpf/[email protected]/
[5] https://github.com/eddyz87/bpf/tree/propagate-read-and-precision-in-cfg
[6] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting
[7] https://github.com/eddyz87/bpf/tree/scc-accumulate-backedges
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Jun 12, 2025
…kedges'
Eduard Zingerman says:
====================
bpf: propagate read/precision marks over state graph backedges
Current loop_entry-based states comparison logic does not handle the
following case:
.-> A --. Assume the states are visited in the order A, B, C.
| | | Assume that state B reaches a state equivalent to state A.
| v v At this point, state C is not processed yet, so state A
'-- B C has not received any read or precision marks from C.
As a result, these marks won't be propagated to B.
If B has incomplete marks, it is unsafe to use it in states_equal()
checks. This issue was first reported in [1].
This patch-set
--------------
Here is the gist of the algorithm implemented by this patch-set:
- Compute strongly connected components (SCCs) in the program CFG.
- When a verifier state enters an SCC, that state is recorded as the
SCC's entry point.
- When a verifier state is found to be equivalent to another
(e.g., B to A in the example above), it is recorded as a
states-graph backedge.
- Backedges are accumulated per SCC (*).
- When an SCC entry state reaches `branches == 0`, propagate read and
precision marks through the backedges until a fixed point is reached
(e.g., from A to B, from C to A, and then again from A to B).
(*) This is an oversimplification, see patch #8 for details.
Unfortunately, this means that commit [2] needs to be reverted,
as precision propagation requires access to jump history,
and backedges represent history not belonging to `env->cur_state`.
Details are provided in patch #8; a comment in `is_state_visited()`
explains most of the mechanics.
Patch #2 adds a `compute_scc()` function, which computes SCCs in the
program CFG. This function was tested using property-based testing in
[3], but it is not included in selftests.
Previous attempt
----------------
A previous attempt to fix this is described in [4]:
1. Within the states loop, `states_equal(... RANGE_WITHIN)` ignores
read and precision marks.
2. For states outside the loop, all registers for states within the
loop are marked as read and precise.
This approach led to an 86x regression on the `cond_break1` selftest.
In that test, one loop was followed by another, and a certain variable
was incremented in the second loop. This variable was marked as
precise due to rule (2), which hindered convergence in the first loop.
After some off-list discussion, it was decided that this might be a
typical case and such regressions are undesirable.
This patch-set avoids such eager precision markings.
Alternatives
------------
Another option is to associate a mask of read/written/precise stack
slots with each instruction. This mask can be populated during
verifier states exploration. Upon reaching an `EXIT` instruction or an
equivalent state, the accumulated masks can be used to propagate
read/written/precise bits across the program's control flow graph
using an analysis similar to use-def.
Unfortunately, a naive implementation of this approach [5] results in
a 10x regression in `veristat` for some `sched_ext` programs due to
the inability to express the must-write property. This issue requires
further investigation.
Changes in verification performance
-----------------------------------
There are some veristat regressions when comparing with master using
selftests and sched_ext BPF binaries. The comparison is done using
master from [6] and this patch-set from [7] where memory accounting
logic is added to veristat.
========= selftests: master vs patch-set =========
File Program Insns Peak memory (KiB)
--------------------- ----------------------------------- ----- ----- ---------------- ---- ----- ----------------
bpf_qdisc_fq.bpf.o bpf_fq_dequeue 1187 1645 +458 (+38.58%) 768 1240 +472 (+61.46%)
dynptr_success.bpf.o test_copy_from_user_str_dynptr 208 279 +71 (+34.13%) 512 1024 +512 (+100.00%)
dynptr_success.bpf.o test_copy_from_user_task_str_dynptr 205 263 +58 (+28.29%) 512 1024 +512 (+100.00%)
dynptr_success.bpf.o test_probe_read_kernel_str_dynptr 686 857 +171 (+24.93%) 992 1724 +732 (+73.79%)
dynptr_success.bpf.o test_probe_read_user_str_dynptr 689 860 +171 (+24.82%) 1016 1744 +728 (+71.65%)
iters.bpf.o checkpoint_states_deletion 1211 1216 +5 (+0.41%) 512 1280 +768 (+150.00%)
pyperf600_iter.bpf.o on_event 2591 5929 +3338 (+128.83%) 4744 11176 +6432 (+135.58%)
verifier_gotol.bpf.o gotol_large_imm 40004 40004 +0 (+0.00%) 1024 1536 +512 (+50.00%)
Total progs: 3725
Old success: 2157
New success: 2157
total_insns diff min: 0.00%
total_insns diff max: 128.83%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 837,487
total_insns abs max new: 837,487
0 .. 5 %: 3710
5 .. 15 %: 6
20 .. 30 %: 6
30 .. 40 %: 2
125 .. 130 %: 1
mem_peak diff min: -27.78%
mem_peak diff max: 198.44%
mem_peak abs max old: 269,312 KiB
mem_peak abs max new: 269,312 KiB
-30 .. -20 %: 1
-5 .. 0 %: 18
0 .. 5 %: 3568
5 .. 15 %: 4
15 .. 25 %: 3
45 .. 55 %: 4
60 .. 70 %: 1
70 .. 80 %: 2
100 .. 110 %: 3
135 .. 145 %: 1
150 .. 160 %: 1
195 .. 200 %: 1
========= scx: master vs patch-set =========
Program Insns Peak memory (KiB)
------------------------ ----- ----- --------------- ----- ----- -----------------
arena_topology_node_init 2133 2395 +262 (+12.28%) 768 768 +0 (+0.00%)
chaos_dispatch 2835 2868 +33 (+1.16%) 1972 1720 -252 (-12.78%)
chaos_init 4324 5210 +886 (+20.49%) 2528 3028 +500 (+19.78%)
lavd_cpu_offline 5107 5726 +619 (+12.12%) 4188 6304 +2116 (+50.53%)
lavd_cpu_online 5107 5726 +619 (+12.12%) 4188 6304 +2116 (+50.53%)
lavd_dispatch 41775 47601 +5826 (+13.95%) 6196 29192 +22996 (+371.14%)
lavd_enqueue 20238 24188 +3950 (+19.52%) 22084 42156 +20072 (+90.89%)
lavd_init 6974 7685 +711 (+10.20%) 5428 6928 +1500 (+27.63%)
lavd_select_cpu 22138 26088 +3950 (+17.84%) 24448 43688 +19240 (+78.70%)
layered_dispatch 17847 26581 +8734 (+48.94%) 11728 28740 +17012 (+145.05%)
layered_dump 1891 2098 +207 (+10.95%) 2036 3048 +1012 (+49.71%)
layered_runnable 2606 2634 +28 (+1.07%) 748 1240 +492 (+65.78%)
p2dq_init 3691 4554 +863 (+23.38%) 2016 2528 +512 (+25.40%)
rusty_enqueue 28853 28853 +0 (+0.00%) 2072 1824 -248 (-11.97%)
rusty_init_task 31128 31128 +0 (+0.00%) 2176 2560 +384 (+17.65%)
Total progs: 148
Old success: 135
New success: 135
total_insns diff min: 0.00%
total_insns diff max: 48.94%
0 -> value: 0
value -> 0: 0
total_insns abs max old: 41,775
total_insns abs max new: 47,601
0 .. 5 %: 133
5 .. 15 %: 7
15 .. 25 %: 4
35 .. 45 %: 3
45 .. 50 %: 1
mem_peak diff min: -12.78%
mem_peak diff max: 371.14%
mem_peak abs max old: 24,448 KiB
mem_peak abs max new: 43,688 KiB
-15 .. -5 %: 2
-5 .. 0 %: 2
0 .. 5 %: 129
5 .. 15 %: 1
15 .. 25 %: 2
25 .. 35 %: 2
45 .. 55 %: 3
65 .. 75 %: 1
75 .. 85 %: 1
90 .. 100 %: 1
145 .. 155 %: 1
195 .. 205 %: 1
370 .. 375 %: 1
Changelog
---------
v1: https://lore.kernel.org/bpf/[email protected]/
v1 -> v2:
- Rebase
- added mem_peak statistics (Alexei)
- selftests: fixed comments and removed useless r7 assignments (Yonghong)
v2: https://lore.kernel.org/bpf/[email protected]/
v2 -> v3:
- Rebase
Links
-----
[1] https://lore.kernel.org/bpf/[email protected]/
[2] commit 96a30e4 ("bpf: use common instruction history across all states")
[3] https://github.com/eddyz87/scc-test
[4] https://lore.kernel.org/bpf/[email protected]/
[5] https://github.com/eddyz87/bpf/tree/propagate-read-and-precision-in-cfg
[6] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting
[7] https://github.com/eddyz87/bpf/tree/scc-accumulate-backedges
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Jun 26, 2025
Jann Horn reported a use-after-free in unix_stream_read_generic().
The following sequences reproduce the issue:
$ python3
from socket import *
s1, s2 = socketpair(AF_UNIX, SOCK_STREAM)
s1.send(b'x', MSG_OOB)
s2.recv(1, MSG_OOB) # leave a consumed OOB skb
s1.send(b'y', MSG_OOB)
s2.recv(1, MSG_OOB) # leave a consumed OOB skb
s1.send(b'z', MSG_OOB)
s2.recv(1) # recv 'z' illegally
s2.recv(1, MSG_OOB) # access 'z' skb (use-after-free)
Even though a user reads OOB data, the skb holding the data stays on
the recv queue to mark the OOB boundary and break the next recv().
After the last send() in the scenario above, the sk2's recv queue has
2 leading consumed OOB skbs and 1 real OOB skb.
Then, the following happens during the next recv() without MSG_OOB
1. unix_stream_read_generic() peeks the first consumed OOB skb
2. manage_oob() returns the next consumed OOB skb
3. unix_stream_read_generic() fetches the next not-yet-consumed OOB skb
4. unix_stream_read_generic() reads and frees the OOB skb
, and the last recv(MSG_OOB) triggers KASAN splat.
The 3. above occurs because of the SO_PEEK_OFF code, which does not
expect unix_skb_len(skb) to be 0, but this is true for such consumed
OOB skbs.
while (skip >= unix_skb_len(skb)) {
skip -= unix_skb_len(skb);
skb = skb_peek_next(skb, &sk->sk_receive_queue);
...
}
In addition to this use-after-free, there is another issue that
ioctl(SIOCATMARK) does not function properly with consecutive consumed
OOB skbs.
So, nothing good comes out of such a situation.
Instead of complicating manage_oob(), ioctl() handling, and the next
ECONNRESET fix by introducing a loop for consecutive consumed OOB skbs,
let's not leave such consecutive OOB unnecessarily.
Now, while receiving an OOB skb in unix_stream_recv_urg(), if its
previous skb is a consumed OOB skb, it is freed.
[0]:
BUG: KASAN: slab-use-after-free in unix_stream_read_actor (net/unix/af_unix.c:3027)
Read of size 4 at addr ffff888106ef2904 by task python3/315
CPU: 2 UID: 0 PID: 315 Comm: python3 Not tainted 6.16.0-rc1-00407-gec315832f6f9 #8 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.fc42 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
kasan_report (mm/kasan/report.c:636)
unix_stream_read_actor (net/unix/af_unix.c:3027)
unix_stream_read_generic (net/unix/af_unix.c:2708 net/unix/af_unix.c:2847)
unix_stream_recvmsg (net/unix/af_unix.c:3048)
sock_recvmsg (net/socket.c:1063 (discriminator 20) net/socket.c:1085 (discriminator 20))
__sys_recvfrom (net/socket.c:2278)
__x64_sys_recvfrom (net/socket.c:2291 (discriminator 1) net/socket.c:2287 (discriminator 1) net/socket.c:2287 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f8911fcea06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007fffdb0dccb0 EFLAGS: 00000202 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007fffdb0dcdc8 RCX: 00007f8911fcea06
RDX: 0000000000000001 RSI: 00007f8911a5e060 RDI: 0000000000000006
RBP: 00007fffdb0dccd0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000202 R12: 00007f89119a7d20
R13: ffffffffc4653600 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 315:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
__kasan_slab_alloc (mm/kasan/common.c:348)
kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
__alloc_skb (net/core/skbuff.c:660 (discriminator 4))
alloc_skb_with_frags (./include/linux/skbuff.h:1336 net/core/skbuff.c:6668)
sock_alloc_send_pskb (net/core/sock.c:2993)
unix_stream_sendmsg (./include/net/sock.h:1847 net/unix/af_unix.c:2256 net/unix/af_unix.c:2418)
__sys_sendto (net/socket.c:712 (discriminator 20) net/socket.c:727 (discriminator 20) net/socket.c:2226 (discriminator 20))
__x64_sys_sendto (net/socket.c:2233 (discriminator 1) net/socket.c:2229 (discriminator 1) net/socket.c:2229 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Freed by task 315:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
__kasan_slab_free (mm/kasan/common.c:271)
kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
unix_stream_read_generic (net/unix/af_unix.c:3010)
unix_stream_recvmsg (net/unix/af_unix.c:3048)
sock_recvmsg (net/socket.c:1063 (discriminator 20) net/socket.c:1085 (discriminator 20))
__sys_recvfrom (net/socket.c:2278)
__x64_sys_recvfrom (net/socket.c:2291 (discriminator 1) net/socket.c:2287 (discriminator 1) net/socket.c:2287 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
The buggy address belongs to the object at ffff888106ef28c0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 68 bytes inside of
freed 224-byte region [ffff888106ef28c0, ffff888106ef29a0)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888106ef3cc0 pfn:0x106ef2
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff8881001d28c0 ffffea000422fe00 0000000000000004
raw: ffff888106ef3cc0 0000000080190010 00000000f5000000 0000000000000000
head: 0200000000000040 ffff8881001d28c0 ffffea000422fe00 0000000000000004
head: ffff888106ef3cc0 0000000080190010 00000000f5000000 0000000000000000
head: 0200000000000001 ffffea00041bbc81 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888106ef2800: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
ffff888106ef2880: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888106ef2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888106ef2980: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff888106ef2a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 314001f ("af_unix: Add OOB support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Jun 27, 2025
In rtl8187_stop() move the call of usb_kill_anchored_urbs() before clearing b_tx_status.queue. This change prevents callbacks from using already freed skb due to anchor was not killed before freeing such skb. BUG: kernel NULL pointer dereference, address: 0000000000000080 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.15.0 #8 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:ieee80211_tx_status_irqsafe+0x21/0xc0 [mac80211] Call Trace: <IRQ> rtl8187_tx_cb+0x116/0x150 [rtl8187] __usb_hcd_giveback_urb+0x9d/0x120 usb_giveback_urb_bh+0xbb/0x140 process_one_work+0x19b/0x3c0 bh_worker+0x1a7/0x210 tasklet_action+0x10/0x30 handle_softirqs+0xf0/0x340 __irq_exit_rcu+0xcd/0xf0 common_interrupt+0x85/0xa0 </IRQ> Tested on RTL8187BvE device. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c1db52b ("rtl8187: Use usb anchor facilities to manage urbs") Signed-off-by: Daniil Dulov <[email protected]> Reviewed-by: Ping-Ke Shih <[email protected]> Signed-off-by: Ping-Ke Shih <[email protected]> Link: https://patch.msgid.link/[email protected]
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Jul 18, 2025
It is reported that on Acer Nitro V15 suspend only works properly if the keyboard backlight is turned off. In looking through the issue Acer Nitro V15 has a GPIO (#8) specified in _AEI but it has no matching notify device in _EVT. The values for GPIO #8 change as keyboard backlight is turned on and off. This makes it seem that GPIO #8 is actually supposed to be solely for keyboard backlight. Turning off the interrupt for this GPIO fixes the issue. Add a quirk that does just that. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4169 Signed-off-by: Mario Limonciello <[email protected]> Acked-by: Mika Westerberg <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Jul 18, 2025
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
[exception RIP: __nf_ct_delete_from_lists+172]
[..]
#7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
#8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
#9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
[..]
The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:
ct hlist pointer is garbage; looks like the ct hash value
(hence crash).
ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
ct->timeout is 30000 (=30s), which is unexpected.
Everything else looks like normal udp conntrack entry. If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
- ct hlist pointers are overloaded and store/cache the raw tuple hash
- ct->timeout matches the relative time expected for a new udp flow
rather than the absolute 'jiffies' value.
If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.
Theory is that we did hit following race:
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
E is re-inited,
ct->status set to 0
reply tuplehash hnnode.pprev
stores hash value.
cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z. cpu y was preempted before
checking for expiry and/or confirm bit.
->refcnt set to 1
E now owned by skb
->timeout set to 30000
If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.
nf_conntrack_confirm gets called
sets: ct->status |= CONFIRMED
This is wrong: E is not yet added
to hashtable.
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
cpu y will try to delete E from the hashtable:
nf_ct_delete() -> set DYING bit
__nf_ct_delete_from_lists
Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:
wait for spinlock held by z
CONFIRMED is set but there is no
guarantee ct will be added to hash:
"chaintoolong" or "clash resolution"
logic both skip the insert step.
reply hnnode.pprev still stores the
hash value.
unlocks spinlock
return NF_DROP
<unblocks, then
crashes on hlist_nulls_del_rcu pprev>
In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.
Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.
Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.
It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:
Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.
Also change nf_ct_should_gc() to first check the confirmed bit.
The gc sequence is:
1. Check if entry has expired, if not skip to next entry
2. Obtain a reference to the expired entry.
3. Call nf_ct_should_gc() to double-check step 1.
nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time. Step 3 will therefore not remove the entry.
Without this change to nf_ct_should_gc() we could still get this sequence:
1. Check if entry has expired.
2. Obtain a reference.
3. Call nf_ct_should_gc() to double-check step 1:
4 - entry is still observed as expired
5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
and confirm bit gets set
6 - confirm bit is seen
7 - valid entry is removed again
First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.
This change cannot be backported to releases before 5.19. Without
commit 8a75a2c ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.
Cc: Razvan Cojocaru <[email protected]>
Link: https://lore.kernel.org/netfilter-devel/[email protected]/
Link: https://lore.kernel.org/netfilter-devel/[email protected]/
Fixes: 1397af5 ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 2, 2025
pert script tests fails with segmentation fault as below:
92: perf script tests:
--- start ---
test child forked, pid 103769
DB test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ]
/usr/libexec/perf-core/tests/shell/script.sh: line 35:
103780 Segmentation fault (core dumped)
perf script -i "${perfdatafile}" -s "${db_test}"
--- Cleaning up ---
---- end(-1) ----
92: perf script tests : FAILED!
Backtrace pointed to :
#0 0x0000000010247dd0 in maps.machine ()
#1 0x00000000101d178c in db_export.sample ()
#2 0x00000000103412c8 in python_process_event ()
#3 0x000000001004eb28 in process_sample_event ()
#4 0x000000001024fcd0 in machines.deliver_event ()
#5 0x000000001025005c in perf_session.deliver_event ()
#6 0x00000000102568b0 in __ordered_events__flush.part.0 ()
#7 0x0000000010251618 in perf_session.process_events ()
#8 0x0000000010053620 in cmd_script ()
#9 0x00000000100b5a28 in run_builtin ()
#10 0x00000000100b5f94 in handle_internal_command ()
#11 0x0000000010011114 in main ()
Further investigation reveals that this occurs in the `perf script tests`,
because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`.
With `perf_db_export_mode` enabled, if a sample originates from a hypervisor,
perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL
when `maps__machine(al->maps)` is called from `db_export__sample`.
As al->maps can be NULL in case of Hypervisor samples , use thread->maps
because even for Hypervisor sample, machine should exist.
If we don't have machine for some reason, return -1 to avoid segmentation fault.
Reported-by: Disha Goel <[email protected]>
Signed-off-by: Aditya Bodkhe <[email protected]>
Reviewed-by: Adrian Hunter <[email protected]>
Tested-by: Disha Goel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Adrian Hunter <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 2, 2025
Without the change `perf `hangs up on charaster devices. On my system
it's enough to run system-wide sampler for a few seconds to get the
hangup:
$ perf record -a -g --call-graph=dwarf
$ perf report
# hung
`strace` shows that hangup happens on reading on a character device
`/dev/dri/renderD128`
$ strace -y -f -p 2780484
strace: Process 2780484 attached
pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached
It's call trace descends into `elfutils`:
$ gdb -p 2780484
(gdb) bt
#0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0)
at ../sysdeps/unix/sysv/linux/pread64.c:25
#1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1
#2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 ()
from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0)
at util/dso.h:537
#6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114
#7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242
#8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0,
thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127,
best_effort=best_effort@entry=false) at util/thread.h:152
#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0,
sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939
#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540,
max_stack=127, symbols=true) at util/machine.c:2920
#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440,
sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true)
at util/machine.c:2970
#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440,
sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198
#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0,
evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127
#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127,
arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255
#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540,
evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334
#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0,
file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367
#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245
#22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324
#23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224,
file_path=0x105ffbf0 "perf.data") at util/session.c:1419
#24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0,
--Type <RET> for more, q to quit, c to continue without paging--
quit
prog=prog@entry=0x7fff9df81220) at util/session.c:2132
#25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220)
at util/session.c:2181
#26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226
#27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390
#28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076
#29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827
#30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0)
at perf.c:351
#31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404
#32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448
#33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556
The hangup happens because nothing in` perf` or `elfutils` checks if a
mapped file is easily readable.
The change conservatively skips all non-regular files.
Signed-off-by: Sergei Trofimovich <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 2, 2025
Symbolize stack traces by creating a live machine. Add this
functionality to dump_stack and switch dump_stack users to use
it. Switch TUI to use it. Add stack traces to the child test function
which can be useful to diagnose blocked code.
Example output:
```
$ perf test -vv PERF_RECORD_
...
7: PERF_RECORD_* events & perf_sample fields:
7: PERF_RECORD_* events & perf_sample fields : Running (1 active)
^C
Signal (2) while running tests.
Terminating tests with the same signal
Internal test harness failure. Completing any started tests:
: 7: PERF_RECORD_* events & perf_sample fields:
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#4 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#5 0x7fc12ff02d68 in __sleep sleep.c:55
#6 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#7 0x55788c620fb0 in run_test_child builtin-test.c:0
#8 0x55788c5bd18d in start_command run-command.c:127
#9 0x55788c621ef3 in __cmd_test builtin-test.c:0
#10 0x55788c6225bf in cmd_test ??:0
#11 0x55788c5afbd0 in run_builtin perf.c:0
#12 0x55788c5afeeb in handle_internal_command perf.c:0
#13 0x55788c52b383 in main ??:0
#14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#16 0x55788c52b9d1 in _start ??:0
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45
#3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26
#4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36
#5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0
#6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#9 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#10 0x7fc12ff02d68 in __sleep sleep.c:55
#11 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#12 0x55788c620fb0 in run_test_child builtin-test.c:0
#13 0x55788c5bd18d in start_command run-command.c:127
#14 0x55788c621ef3 in __cmd_test builtin-test.c:0
#15 0x55788c6225bf in cmd_test ??:0
#16 0x55788c5afbd0 in run_builtin perf.c:0
#17 0x55788c5afeeb in handle_internal_command perf.c:0
#18 0x55788c52b383 in main ??:0
#19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#21 0x55788c52b9d1 in _start ??:0
7: PERF_RECORD_* events & perf_sample fields : Skip (permissions)
```
Signed-off-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 2, 2025
Calling perf top with branch filters enabled on Intel CPU's
with branch counters logging (A.K.A LBR event logging [1]) support
results in a segfault.
$ perf top -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter
...
Thread 27 "perf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffafff76c0 (LWP 949003)]
perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
653 *width = env->cpu_pmu_caps ? env->br_cntr_width :
(gdb) bt
#0 perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
#1 0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345
#2 0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389
#3 0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422
#4 0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850
#5 0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737
#6 0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359
#7 0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845
#8 0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211
#9 0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245
#10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324
#11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342
#12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120
#13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448
#14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
The cause is that perf_env__find_br_cntr_info tries to access a
null pointer pmu_caps in the perf_env struct. A similar issue exists
for homogeneous core systems which use the cpu_pmu_caps structure.
Fix this by populating cpu_pmu_caps and pmu_caps structures with
values from sysfs when calling perf top with branch stack sampling
enabled.
[1], LBR event logging introduced here:
https://lore.kernel.org/all/[email protected]/
Reviewed-by: Ian Rogers <[email protected]>
Signed-off-by: Thomas Falcon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 9, 2025
As syzbot [1] reported as below:
R10: 0000000000000100 R11: 0000000000000206 R12: 00007ffe17473450
R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520
</TASK>
---[ end trace 0000000000000000 ]---
==================================================================
BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
Read of size 8 at addr ffff88812d962278 by task syz-executor/564
CPU: 1 PID: 564 Comm: syz-executor Tainted: G W 6.1.129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack+0x21/0x24 lib/dump_stack.c:88
dump_stack_lvl+0xee/0x158 lib/dump_stack.c:106
print_address_description+0x71/0x210 mm/kasan/report.c:316
print_report+0x4a/0x60 mm/kasan/report.c:427
kasan_report+0x122/0x150 mm/kasan/report.c:531
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351
__list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
__list_del_entry include/linux/list.h:134 [inline]
list_del_init include/linux/list.h:206 [inline]
f2fs_inode_synced+0xf7/0x2e0 fs/f2fs/super.c:1531
f2fs_update_inode+0x74/0x1c40 fs/f2fs/inode.c:585
f2fs_update_inode_page+0x137/0x170 fs/f2fs/inode.c:703
f2fs_write_inode+0x4ec/0x770 fs/f2fs/inode.c:731
write_inode fs/fs-writeback.c:1460 [inline]
__writeback_single_inode+0x4a0/0xab0 fs/fs-writeback.c:1677
writeback_single_inode+0x221/0x8b0 fs/fs-writeback.c:1733
sync_inode_metadata+0xb6/0x110 fs/fs-writeback.c:2789
f2fs_sync_inode_meta+0x16d/0x2a0 fs/f2fs/checkpoint.c:1159
block_operations fs/f2fs/checkpoint.c:1269 [inline]
f2fs_write_checkpoint+0xca3/0x2100 fs/f2fs/checkpoint.c:1658
kill_f2fs_super+0x231/0x390 fs/f2fs/super.c:4668
deactivate_locked_super+0x98/0x100 fs/super.c:332
deactivate_super+0xaf/0xe0 fs/super.c:363
cleanup_mnt+0x45f/0x4e0 fs/namespace.c:1186
__cleanup_mnt+0x19/0x20 fs/namespace.c:1193
task_work_run+0x1c6/0x230 kernel/task_work.c:203
exit_task_work include/linux/task_work.h:39 [inline]
do_exit+0x9fb/0x2410 kernel/exit.c:871
do_group_exit+0x210/0x2d0 kernel/exit.c:1021
__do_sys_exit_group kernel/exit.c:1032 [inline]
__se_sys_exit_group kernel/exit.c:1030 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1030
x64_sys_call+0x7b4/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
RIP: 0033:0x7f28b1b8e169
Code: Unable to access opcode bytes at 0x7f28b1b8e13f.
RSP: 002b:00007ffe174710a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f28b1c10879 RCX: 00007f28b1b8e169
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000002 R08: 00007ffe1746ee47 R09: 00007ffe17472360
R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe17472360
R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520
</TASK>
Allocated by task 569:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_alloc_info+0x25/0x30 mm/kasan/generic.c:505
__kasan_slab_alloc+0x72/0x80 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook+0x4f/0x2c0 mm/slab.h:737
slab_alloc_node mm/slub.c:3398 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc_lru+0x104/0x220 mm/slub.c:3429
alloc_inode_sb include/linux/fs.h:3245 [inline]
f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x186/0x880 fs/inode.c:1373
f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483
f2fs_lookup+0x366/0xab0 fs/f2fs/namei.c:487
__lookup_slow+0x2a3/0x3d0 fs/namei.c:1690
lookup_slow+0x57/0x70 fs/namei.c:1707
walk_component+0x2e6/0x410 fs/namei.c:1998
lookup_last fs/namei.c:2455 [inline]
path_lookupat+0x180/0x490 fs/namei.c:2479
filename_lookup+0x1f0/0x500 fs/namei.c:2508
vfs_statx+0x10b/0x660 fs/stat.c:229
vfs_fstatat fs/stat.c:267 [inline]
vfs_lstat include/linux/fs.h:3424 [inline]
__do_sys_newlstat fs/stat.c:423 [inline]
__se_sys_newlstat+0xd5/0x350 fs/stat.c:417
__x64_sys_newlstat+0x5b/0x70 fs/stat.c:417
x64_sys_call+0x393/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
Freed by task 13:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x31/0x50 mm/kasan/generic.c:516
____kasan_slab_free+0x132/0x180 mm/kasan/common.c:236
__kasan_slab_free+0x11/0x20 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:177 [inline]
slab_free_hook mm/slub.c:1724 [inline]
slab_free_freelist_hook+0xc2/0x190 mm/slub.c:1750
slab_free mm/slub.c:3661 [inline]
kmem_cache_free+0x12d/0x2a0 mm/slub.c:3683
f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1562
i_callback+0x4c/0x70 fs/inode.c:250
rcu_do_batch+0x503/0xb80 kernel/rcu/tree.c:2297
rcu_core+0x5a2/0xe70 kernel/rcu/tree.c:2557
rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574
handle_softirqs+0x178/0x500 kernel/softirq.c:578
run_ksoftirqd+0x28/0x30 kernel/softirq.c:945
smpboot_thread_fn+0x45a/0x8c0 kernel/smpboot.c:164
kthread+0x270/0x310 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Last potentially related work creation:
kasan_save_stack+0x3a/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xb6/0xc0 mm/kasan/generic.c:486
kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496
call_rcu+0xd4/0xf70 kernel/rcu/tree.c:2845
destroy_inode fs/inode.c:316 [inline]
evict+0x7da/0x870 fs/inode.c:720
iput_final fs/inode.c:1834 [inline]
iput+0x62b/0x830 fs/inode.c:1860
do_unlinkat+0x356/0x540 fs/namei.c:4397
__do_sys_unlink fs/namei.c:4438 [inline]
__se_sys_unlink fs/namei.c:4436 [inline]
__x64_sys_unlink+0x49/0x50 fs/namei.c:4436
x64_sys_call+0x958/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
The buggy address belongs to the object at ffff88812d961f20
which belongs to the cache f2fs_inode_cache of size 1200
The buggy address is located 856 bytes inside of
1200-byte region [ffff88812d961f20, ffff88812d9623d0)
The buggy address belongs to the physical page:
page:ffffea0004b65800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d960
head:ffffea0004b65800 order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=1)
raw: 4000000000010200 0000000000000000 dead000000000122 ffff88810a94c500
raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Reclaimable, gfp_mask 0x1d2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 569, tgid 568 (syz.2.16), ts 55943246141, free_ts 0
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1d0/0x1f0 mm/page_alloc.c:2532
prep_new_page mm/page_alloc.c:2539 [inline]
get_page_from_freelist+0x2e63/0x2ef0 mm/page_alloc.c:4328
__alloc_pages+0x235/0x4b0 mm/page_alloc.c:5605
alloc_slab_page include/linux/gfp.h:-1 [inline]
allocate_slab mm/slub.c:1939 [inline]
new_slab+0xec/0x4b0 mm/slub.c:1992
___slab_alloc+0x6f6/0xb50 mm/slub.c:3180
__slab_alloc+0x5e/0xa0 mm/slub.c:3279
slab_alloc_node mm/slub.c:3364 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc_lru+0x13f/0x220 mm/slub.c:3429
alloc_inode_sb include/linux/fs.h:3245 [inline]
f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x186/0x880 fs/inode.c:1373
f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483
f2fs_fill_super+0x3ad7/0x6bb0 fs/f2fs/super.c:4293
mount_bdev+0x2ae/0x3e0 fs/super.c:1443
f2fs_mount+0x34/0x40 fs/f2fs/super.c:4642
legacy_get_tree+0xea/0x190 fs/fs_context.c:632
vfs_get_tree+0x89/0x260 fs/super.c:1573
do_new_mount+0x25a/0xa20 fs/namespace.c:3056
page_owner free stack trace missing
Memory state around the buggy address:
ffff88812d962100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88812d962180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88812d962200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88812d962280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88812d962300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
[1] https://syzkaller.appspot.com/x/report.txt?x=13448368580000
This bug can be reproduced w/ the reproducer [2], once we enable
CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below,
so the direct reason of this bug is the same as the one below patch [3]
fixed.
kernel BUG at fs/f2fs/inode.c:857!
RIP: 0010:f2fs_evict_inode+0x1204/0x1a20
Call Trace:
<TASK>
evict+0x32a/0x7a0
do_unlinkat+0x37b/0x5b0
__x64_sys_unlink+0xad/0x100
do_syscall_64+0x5a/0xb0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0010:f2fs_evict_inode+0x1204/0x1a20
[2] https://syzkaller.appspot.com/x/repro.c?x=17495ccc580000
[3] https://lore.kernel.org/linux-f2fs-devel/[email protected]
Tracepoints before panic:
f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file1
f2fs_unlink_exit: dev = (7,0), ino = 7, ret = 0
f2fs_evict_inode: dev = (7,0), ino = 7, pino = 3, i_mode = 0x81ed, i_size = 10, i_nlink = 0, i_blocks = 0, i_advise = 0x0
f2fs_truncate_node: dev = (7,0), ino = 7, nid = 8, block_address = 0x3c05
f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file3
f2fs_unlink_exit: dev = (7,0), ino = 8, ret = 0
f2fs_evict_inode: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 9000, i_nlink = 0, i_blocks = 24, i_advise = 0x4
f2fs_truncate: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 0, i_nlink = 0, i_blocks = 24, i_advise = 0x4
f2fs_truncate_blocks_enter: dev = (7,0), ino = 8, i_size = 0, i_blocks = 24, start file offset = 0
f2fs_truncate_blocks_exit: dev = (7,0), ino = 8, ret = -2
The root cause is: in the fuzzed image, dnode #8 belongs to inode #7,
after inode #7 eviction, dnode #8 was dropped.
However there is dirent that has ino #8, so, once we unlink file3, in
f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page()
will fail due to we can not load node #8, result in we missed to call
f2fs_inode_synced() to clear inode dirty status.
Let's fix this by calling f2fs_inode_synced() in error path of
f2fs_evict_inode().
PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129,
but it failed in v6.16-rc4, this is because the testcase will stop due to
other corruption has been detected by f2fs:
F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366]
F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink
Fixes: 0f18b46 ("f2fs: flush inode metadata when checkpoint is doing")
Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 9, 2025
Patch series "extend hung task blocker tracking to rwsems". Inspired by mutex blocker tracking[1], and having already extended it to semaphores, let's now add support for reader-writer semaphores (rwsems). The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while waiting for an rwsem, we just call hung_task_set_blocker(). The hung task detector can then query the rwsem's owner to identify the lock holder. Tracking works reliably for writers, as there can only be a single writer holding the lock, and its task struct is stored in the owner field. The main challenge lies with readers. The owner field points to only one of many concurrent readers, so we might lose track of the blocker if that specific reader unlocks, even while others remain. This is not a significant issue, however. In practice, long-lasting lock contention is almost always caused by a writer. Therefore, reliably tracking the writer is the primary goal of this patch series ;) With this change, the hung task detector can now show blocker task's info like below: [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds. [Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8 [Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340 [Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0 [Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30 [Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10 [Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10 [Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700 [Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710 [Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer> [Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140 [Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> This patch (of 3): In preparation for extending blocker tracking to support rwsems, make the rwsem_owner() and is_rwsem_reader_owned() helpers globally available for determining if the blocker is a writer or one of the readers. Additionally, a stale owner pointer in a reader-owned rwsem can lead to false positives in blocker tracking when CONFIG_DETECT_HUNG_TASK_BLOCKER is enabled. To mitigate this, clear the owner field on the reader unlock path, similar to what CONFIG_DEBUG_RWSEMS does. A NULL owner is better than a stale one for diagnostics. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ [1] Signed-off-by: Lance Yang <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joel Granados <[email protected]> Cc: John Stultz <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Mingzhe Yang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tomasz Figa <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yongliang Gao <[email protected]> Cc: Zi Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Aug 9, 2025
Inspired by mutex blocker tracking[1], and having already extended it to semaphores, let's now add support for reader-writer semaphores (rwsems). The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while waiting for an rwsem, we just call hung_task_set_blocker(). The hung task detector can then query the rwsem's owner to identify the lock holder. Tracking works reliably for writers, as there can only be a single writer holding the lock, and its task struct is stored in the owner field. The main challenge lies with readers. The owner field points to only one of many concurrent readers, so we might lose track of the blocker if that specific reader unlocks, even while others remain. This is not a significant issue, however. In practice, long-lasting lock contention is almost always caused by a writer. Therefore, reliably tracking the writer is the primary goal of this patch series ;) With this change, the hung task detector can now show blocker task's info like below: [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds. [Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8 [Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340 [Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0 [Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30 [Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10 [Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10 [Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700 [Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710 [Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> [Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer> [Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000 [Fri Jun 27 15:21:34 2025] Call Trace: [Fri Jun 27 15:21:34 2025] <TASK> [Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930 [Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80 [Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180 [Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230 [Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140 [Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150 [Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90 [Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0 [Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410 [Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50 [Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0 [Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0 [Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10 [Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0 [Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40 [Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40 [Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003 [Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff [Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000 [Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff [Fri Jun 27 15:21:34 2025] </TASK> [1] https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lance Yang <[email protected]> Suggested-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joel Granados <[email protected]> Cc: John Stultz <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Mingzhe Yang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tomasz Figa <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yongliang Gao <[email protected]> Cc: Zi Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Sep 19, 2025
…tack-analysis' Eduard Zingerman says: ==================== bpf: replace path-sensitive with path-insensitive live stack analysis Consider the following program, assuming checkpoint is created for a state at instruction (3): 1: call bpf_get_prandom_u32() 2: *(u64 *)(r10 - 8) = 42 -- checkpoint #1 -- 3: if r0 != 0 goto +1 4: exit; 5: r0 = *(u64 *)(r10 - 8) 6: exit The verifier processes this program by exploring two paths: - 1 -> 2 -> 3 -> 4 - 1 -> 2 -> 3 -> 5 -> 6 When instruction (5) is processed, the current liveness tracking mechanism moves up the register parent links and records a "read" mark for stack slot -8 at checkpoint #1, stopping because of the "write" mark recorded at instruction (2). This patch set replaces the existing liveness tracking mechanism with a path-insensitive data flow analysis. The program above is processed as follows: - a data structure representing live stack slots for instructions 1-6 in frame #0 is allocated; - when instruction (2) is processed, record that slot -8 is written at instruction (2) in frame #0; - when instruction (5) is processed, record that slot -8 is read at instruction (5) in frame #0; - when instruction (6) is processed, propagate read mark for slot -8 up the control flow graph to instructions 3 and 2. The key difference is that the new mechanism operates on a control flow graph and associates read and write marks with pairs of (call chain, instruction index). In contrast, the old mechanism operates on verifier states and register parent links, associating read and write marks with verifier states. Motivation ========== As it stands, this patch set makes liveness tracking slightly less precise, as it no longer distinguishes individual program paths taken by the verifier during symbolic execution. See the "Impact on verification performance" section for details. However, this change is intended as a stepping stone toward the following goals: - Short term, integrate precision tracking into liveness analysis and remove the following code: - verifier backedge states accumulation in is_state_visited(); - most of the logic for precision tracking; - jump history tracking. - Long term, help with more efficient loop verification handling. Why integrating precision tracking? ----------------------------------- In a sense, precision tracking is very similar to liveness tracking. The data flow equations for liveness tracking look as follows: live_after = U [state[s].live_before for s in insn_successors(i)] state[i].live_before = (live_after / state[i].must_write) U state[i].may_read While data flow equations for precision tracking look as follows: precise_after = U [state[s].precise_before for s in insn_successors(i)] // if some of the instruction outputs are precise, // assume its inputs to be precise induced_precise = ⎧ state[i].may_read if (state[i].may_write ∩ precise_after) ≠ ∅ ⎨ ⎩ ∅ otherwise state[i].precise_before = (precise_after / state[i].must_write) ∩ induced_precise Where: - `may_read` set represents a union of all possibly read slots (any slot in `may_read` set might be by the instruction); - `must_write` set represents an intersection of all possibly written slots (any slot in `must_write` set is guaranteed to be written by the instruction). - `may_write` set represents a union of all possibly written slots (any slot in `may_write` set might be written by the instruction). This means that precision tracking can be implemented as a logical extension of liveness tracking: - track registers as well as stack slots; - add bit masks to represent `precise_before` and `may_write`; - add above equations for `precise_before` computation; - (linked registers require some additional consideration). Such extension would allow removal of: - precision propagation logic in verifier.c: - backtrack_insn() - mark_chain_precision() - propagate_{precision,backedges}() - push_jmp_history() and related data structures, which are only used by precision tracking; - add_scc_backedge() and related backedge state accumulation in is_state_visited(), superseded by per-callchain function state accumulated by liveness analysis. The hope here is that unifying liveness and precision tracking will reduce overall amount of code and make it easier to reason about. How this helps with loops? -------------------------- As it stands, this patch set shares the same deficiency as the current liveness tracking mechanism. Liveness marks on stack slots cannot be used to prune states when processing iterator-based loops: - such states still have branches to be explored; - meaning that not all stack slot reads have been discovered. For example: 1: while(iter_next()) { 2: if (...) 3: r0 = *(u64 *)(r10 - 8) 4: if (...) 5: r0 = *(u64 *)(r10 - 16) 6: ... 7: } For any checkpoint state created at instruction (1), it is only possible to rely on read marks for slots fp[-8] and fp[-16] once all child states of (1) have been explored. Thus, when the verifier transitions from (7) to (1), it cannot rely on read marks. However, sacrificing path-sensitivity makes it possible to run analysis defined in this patch set before main verification pass, if estimates for value ranges are available. E.g. for the following program: 1: while(iter_next()) { 2: r0 = r10 3: r0 += r2 4: r0 = *(u64 *)(r2 + 0) 5: ... 6: } If an estimate for `r2` range is available before the main verification pass, it can be used to populate read marks at instruction (4) and run the liveness analysis. Thus making conservative liveness information available during loops verification. Such estimates can be provided by some form of value range analysis. Value range analysis is also necessary to address loop verification from another angle: computing boundaries for loop induction variables and iteration counts. The hope here is that the new liveness tracking mechanism will support the broader goal of making loop verification more efficient. Validation ========== The change was tested on three program sets: - bpf selftests - sched_ext - Meta's internal set of programs Commit [#8] enables a special mode where both the current and new liveness analyses are enabled simultaneously. This mode signals an error if the new algorithm considers a stack slot dead while the current algorithm assumes it is alive. This mode was very useful for debugging. At the time of posting, no such errors have been reported for the above program sets. [#8] "bpf: signal error if old liveness is more conservative than new" Impact on memory consumption ============================ Debug patch [1] extends the kernel and veristat to count the amount of memory allocated for storing analysis data. This patch is not included in the submission. The maximal observed impact for the above program sets is 2.6Mb. Data below is shown in bytes. For bpf selftests top 5 consumers look as follows: File Program liveness mem ----------------------- ---------------- ------------ pyperf180.bpf.o on_event 2629740 pyperf600.bpf.o on_event 2287662 pyperf100.bpf.o on_event 1427022 test_verif_scale3.bpf.o balancer_ingress 1121283 pyperf_subprogs.bpf.o on_event 756900 For sched_ext top 5 consumers loog as follows: File Program liveness mem --------- ------------------------------- ------------ bpf.bpf.o lavd_enqueue 164686 bpf.bpf.o lavd_select_cpu 157393 bpf.bpf.o layered_enqueue 154817 bpf.bpf.o lavd_init 127865 bpf.bpf.o layered_dispatch 110129 For Meta's internal set of programs top consumer is 1Mb. [1] kernel-patches/bpf@085588e Impact on verification performance ================================== Veristat results below are reported using `-f insns_pct>1 -f !insns<500` filter and -t option (BPF_F_TEST_STATE_FREQ flag). master vs patch-set, selftests (out of ~4K programs) ---------------------------------------------------- File Program Insns (A) Insns (B) Insns (DIFF) -------------------------------- -------------------------------------- --------- --------- --------------- cpumask_success.bpf.o test_global_mask_nested_deep_array_rcu 1622 1655 +33 (+2.03%) strobemeta_bpf_loop.bpf.o on_event 2163 2684 +521 (+24.09%) test_cls_redirect.bpf.o cls_redirect 36001 42515 +6514 (+18.09%) test_cls_redirect_dynptr.bpf.o cls_redirect 2299 2339 +40 (+1.74%) test_cls_redirect_subprogs.bpf.o cls_redirect 69545 78497 +8952 (+12.87%) test_l4lb_noinline.bpf.o balancer_ingress 2993 3084 +91 (+3.04%) test_xdp_noinline.bpf.o balancer_ingress_v4 3539 3616 +77 (+2.18%) test_xdp_noinline.bpf.o balancer_ingress_v6 3608 3685 +77 (+2.13%) master vs patch-set, sched_ext (out of 148 programs) ---------------------------------------------------- File Program Insns (A) Insns (B) Insns (DIFF) --------- ---------------- --------- --------- --------------- bpf.bpf.o chaos_dispatch 2257 2287 +30 (+1.33%) bpf.bpf.o lavd_enqueue 20735 22101 +1366 (+6.59%) bpf.bpf.o lavd_select_cpu 22100 24409 +2309 (+10.45%) bpf.bpf.o layered_dispatch 25051 25606 +555 (+2.22%) bpf.bpf.o p2dq_dispatch 961 990 +29 (+3.02%) bpf.bpf.o rusty_quiescent 526 534 +8 (+1.52%) bpf.bpf.o rusty_runnable 541 547 +6 (+1.11%) Perf report =========== In relative terms, the analysis does not consume much CPU time. For example, here is a perf report collected for pyperf180 selftest: # Children Self Command Shared Object Symbol # ........ ........ ........ .................... ........................................ ... 1.22% 1.22% veristat [kernel.kallsyms] [k] bpf_update_live_stack ... Changelog ========= v1: https://lore.kernel.org/bpf/[email protected]/T/ v1 -> v2: - compute_postorder() fixed to handle jumps with offset -1 (syzbot). - is_state_visited() in patch #9 fixed access to uninitialized `err` (kernel test robot, Dan Carpenter). - Selftests added. - Fixed bug with write marks propagation from callee to caller, see verifier_live_stack.c:caller_stack_write() test case. - Added a patch for __not_msg() annotation for test_loader based tests. v2: https://lore.kernel.org/bpf/20250918-callchain-sensitive-liveness-v2-0-214ed2653eee@gmail.com/ v2 -> v3: - Added __diag_ignore_all("-Woverride-init", ...) in liveness.c for bpf_insn_successors() (suggested by Alexei). Signed-off-by: Eduard Zingerman <[email protected]> ==================== Link: https://patch.msgid.link/20250918-callchain-sensitive-liveness-v3-0-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Sep 24, 2025
Ido Schimmel says: ==================== ipv4: icmp: Fix source IP derivation in presence of VRFs Align IPv4 with IPv6 and in the presence of VRFs generate ICMP error messages with a source IP that is derived from the receiving interface and not from its VRF master. This is especially important when the error messages are "Time Exceeded" messages as it means that utilities like traceroute will show an incorrect packet path. Patches #1-#2 are preparations. Patch #3 is the actual change. Patches #4-#7 make small improvements in the existing traceroute test. Patch #8 extends the traceroute test with VRF test cases for both IPv4 and IPv6. Changes since v1 [1]: * Rebase. [1] https://lore.kernel.org/netdev/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Sep 24, 2025
Petr Machata says: ==================== bridge: Allow keeping local FDB entries only on VLAN 0 The bridge FDB contains one local entry per port per VLAN, for the MAC of the port in question, and likewise for the bridge itself. This allows bridge to locally receive and punt "up" any packets whose destination MAC address matches that of one of the bridge interfaces or of the bridge itself. The number of these local "service" FDB entries grows linearly with number of bridge-global VLAN memberships, but that in turn will tend to grow quadratically with number of ports and per-port VLAN memberships. While that does not cause issues during forwarding lookups, it does make dumps impractically slow. As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB that just contains these 400K local entries, takes 6.5s. That's _without_ considering iproute2 formatting overhead, this is just how long it takes to walk the FDB (repeatedly), serialize it into netlink messages, and parse the messages back in userspace. This is to illustrate that with growing number of ports and VLANs, the time required to dump this repetitive information blows up. Arguably 4K VLANs per interface is not a very realistic configuration, but then modern switches can instead have several hundred interfaces, and we have fielded requests for >1K VLAN memberships per port among customers. FDB entries are currently all kept on a single linked list, and then dumping uses this linked list to walk all entries and dump them in order. When the message buffer is full, the iteration is cut short, and later restarted. Of course, to restart the iteration, it's first necessary to walk the already-dumped front part of the list before starting dumping again. So one possibility is to organize the FDB entries in different structure more amenable to walk restarts. One option is to walk directly the hash table. The advantage is that no auxiliary structure needs to be introduced. With a rough sketch of this approach, the above scenario gets dumped in not quite 3 s, saving over 50 % of time. However hash table iteration requires maintaining an active cursor that must be collected when the dump is aborted. It looks like that would require changes in the NDO protocol to allow to run this cleanup. Moreover, on hash table resize the iteration is simply restarted. FDB dumps are currently not guaranteed to correspond to any one particular state: entries can be missed, or be duplicated. But with hash table iteration we would get that plus the much less graceful resize behavior, where swaths of FDB are duplicated. Another option is to maintain the FDB entries in a red-black tree. We have a PoC of this approach on hand, and the above scenario is dumped in about 2.5 s. Still not as snappy as we'd like it, but better than the hash table. However the savings come at the expense of a more expensive insertion, and require locking during dumps, which blocks insertion. The upside of these approaches is that they provide benefits whatever the FDB contents. But it does not seem like either of these is workable. However we intend to clean up the RB tree PoC and present it for consideration later on in case the trade-offs are considered acceptable. Yet another option might be to use in-kernel FDB filtering, and to filter the local entries when dumping. Unfortunately, this does not help all that much either, because the linked-list walk still needs to happen. Also, with the obvious filtering interface built around ndm_flags / ndm_state filtering, one can't just exclude pure local entries in one query. One needs to dump all non-local entries first, and then to get permanent entries in another run filter local & added_by_user. I.e. one needs to pay the iteration overhead twice, and then integrate the result in userspace. To get significant savings, one would need a very specific knob like "dump, but skip/only include local entries". But if we are adding a local-specific knobs, maybe let's have an option to just not duplicate them in the first place. All this FDB duplication is there merely to make things snappy during forwarding. But high-radix switches with thousands of VLANs typically do not process much traffic in the SW datapath at all, but rather offload vast majority of it. So we could exchange some of the runtime performance for a neater FDB. To that end, in this patchset, introduce a new bridge option, BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries installed only on VLAN 0, instead of duplicating them across all VLANs. Then to maintain the local termination behavior, on FDB miss, the bridge does a second lookup on VLAN 0. Enabling this option changes the bridge behavior in expected ways. Since the entries are only kept on VLAN 0, FDB get, flush and dump will not perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects forwarding on all VLANs. This patchset is loosely based on a privately circulated patch by Nikolay Aleksandrov. The patchset progresses as follows: - Patch #1 introduces a bridge option to enable the above feature. Then patches #2 to #5 gradually patch the bridge to do the right thing when the option is enabled. Finally patch #6 adds the UAPI knob and the code for when the feature is enabled or disabled. - Patches #7, #8 and #9 contain fixes and improvements to selftest libraries - Patch #10 contains a new selftest ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Oct 4, 2025
Petr Machata says: ==================== selftests: Mark auto-deferring functions clearly selftests/net/lib.sh contains a suite of iproute2 wrappers that automatically schedule the corresponding cleanup through defer. The fact they do so is however not immediately obvious, one needs to know which functions are handling the deferral behind the scenes, and which expect the caller to handle cleanups themselves. A convention for these auto-deferring functions would help both writing and patch review. This patchset does so by marking these functions with an adf_ prefix. We already have a few such functions: forwarding/lib.sh has adf_mcd_start() and a few selftests add private helpers that conform to this convention. Patches #1 to #8 gradually convert individual functions, one per patch. Patch #9 renames an auto-deferring private helpers named dfr_* to adf_*. The plan is not to retro-rename all private helpers, but I happened to know about this one. Patches #10 to #12 introduce several autodefer helpers for commonly used forwarding/lib.sh functions, and opportunistically convert straightforward instances of 'action; defer counteraction' to the new helpers. Patch #13 adds some README verbiage to pitch defer and the adf_* convention. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot
pushed a commit
that referenced
this pull request
Oct 12, 2025
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 #7 [3800313fd60] crw_collect_info at c91905822 #8 [3800313fe10] kthread at c90feb390 #9 [3800313fe68] __ret_from_fork at c90f6aa64 #10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Benjamin Block <[email protected]> Reviewed-by: Farhan Ali <[email protected]> Reviewed-by: Julian Ruess <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: bpf: reject kfunc calls that overflow insn->imm
version: 4
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=614391