Skip to content

Conversation

@oscarvei
Copy link

@oscarvei oscarvei commented Oct 1, 2012

No description provided.

heftig referenced this pull request in zen-kernel/zen-kernel Oct 1, 2012
…d reasons

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    #24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Cc: [email protected]
ckenna pushed a commit to LITMUS-RT/litmus-rt-odroidx that referenced this pull request Oct 1, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     torvalds#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     torvalds#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     torvalds#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     torvalds#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 2, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
mkrufky pushed a commit to mkrufky/linux that referenced this pull request Oct 2, 2012
The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[    0.444413] Booting Node   0, Processors  #1 #2 #3 #4 #5 Ok.
[    0.461388] ------------[ cut here ]------------
[    0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[    0.473960] Hardware name: Dinar
[    0.477170] sched: CPU torvalds#6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.486860] Booting Node   1, Processors  torvalds#6
[    0.491104] Modules linked in:
[    0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[    0.499510] Call Trace:
[    0.501946]  [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[    0.508185]  [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[    0.514163]  [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[    0.519881]  [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[    0.525943]  [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[    0.532004]  [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[    0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.628197]  torvalds#7 torvalds#8 torvalds#9 torvalds#10 torvalds#11 Ok.
[    0.807108] Booting Node   3, Processors  torvalds#12 torvalds#13 torvalds#14 torvalds#15 torvalds#16 torvalds#17 Ok.
[    0.897587] Booting Node   2, Processors  torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 Ok.
[    0.917443] Brought up 24 CPUs

We ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andreas Herrmann <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 4, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
fabio-porcedda pushed a commit to fabio-porcedda/linux-ge863-pro3 that referenced this pull request Oct 12, 2012
Following commit a79dd5a titled "tty/serial/pmac_zilog: Fix suspend & resume",
my Powerbook G4 Titanium showed the following stack dump:

[   36.878225] irq 23: nobody cared (try booting with the "irqpoll" option)
[   36.878251] Call Trace:
[   36.878291] [dfff3f00] [c000984c] show_stack+0x7c/0x194 (unreliable)
[   36.878322] [dfff3f40] [c00a6868] __report_bad_irq+0x44/0xf4
[   36.878339] [dfff3f60] [c00a6b0] note_interrupt+0x1ec/0x2ac
[   36.878356] [dfff3f80] [c00a48d0] handle_irq_event_percpu+0x250/0x2b8
[   36.878372] [dfff3fd0] [c00a496c] handle_irq_event+0x34/0x54
[   36.878389] [dfff3fe0] [c00a753c] handle_fasteoi_irq+0xb4/0x124
[   36.878412] [dfff3ff0] [c000f5bc] call_handle_irq+0x18/0x28
[   36.878428] [deef1f10] [c000719c] do_IRQ+0x114/0x1cc
[   36.878446] [deef1f40] [c0015868] ret_from_except+0x0/0x1c
[   36.878484] --- Exception: 501 at 0xf497610
[   36.878489]     LR = 0xfdc3dd0
[   36.878497] handlers:
[   36.878510] [<c02b7424>] pmz_interrupt
[   36.878520] Disabling IRQ torvalds#23

From an E-mail exchange about this problem, Andreas Schwab noticed a typo
that resulted in the wrong condition being tested.

The patch also corrects 2 typos that incorrectly report why an error branch
is being taken.

Signed-off-by: Larry Finger <[email protected]>
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
noamc referenced this pull request in Mellanox/linux Oct 16, 2012
…fpga] IPI code comments (cosmetic)

Signed-off-by: Vineet Gupta <[email protected]>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 17, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
hknkkn pushed a commit to hknkkn/linux-dynticks that referenced this pull request Oct 29, 2012
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 torvalds#31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 torvalds#6 torvalds#7 Ok.
Booting Node   1, Processors  torvalds#8 torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 Ok.
Booting Node   0, Processors  torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 Ok.
Booting Node   1, Processors  torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31
Brought up 32 CPUs

Acked-by: Borislav Petkov <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
hknkkn pushed a commit to hknkkn/linux-dynticks that referenced this pull request Oct 29, 2012
This patch converts the remaining struct se_portal_group->session_lock
usage to use irqsave+irqrestore to address the following warnings for
hardware target mode interrupt context usage.  This change generate
other warnings for current iscsi-target mode still using ->session_lock
with spin_lock_bh, which will need to be converted in a seperate patch.

[  492.480728] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
[  492.488194] 3.0.0+ torvalds#23
[  492.490820] ------------------------------------------------------
[  492.497704] sh/7162 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
[  492.504493] (&(&se_tpg->session_lock)->rlock){+.....}, at: [<ffffffffa022364d>] transport_deregister_session+0x2d/0x163 [target_core_mod]
  492.518390]
[  492.518390] and this task is already holding:
[  492.524897] (&(&ha->hardware_lock)->rlock){-.-...}, at: [<ffffffffa00b9146>] qla_tgt_stop_phase1+0x5e/0x27e [qla2xxx]
[  492.536856] which would create a new lock dependency:
[  492.542481] (&(&ha->hardware_lock)->rlock){-.-...} -> (&(&se_tpg->session_lock)->rlock){+.....}
[  492.552321]
[  492.552321] but this new dependency connects a HARDIRQ-irq-safe lock:
[  492.561149] (&(&ha->hardware_lock)->rlock){-.-...}
[  492.566400] ... which became HARDIRQ-irq-safe at:
[  492.571841]   [<ffffffff81064720>] __lock_acquire+0x68f/0x921
[  492.578247]   [<ffffffff81064eff>] lock_acquire+0xe0/0x10d
[  492.584367]   [<ffffffff813a74c6>] _raw_spin_lock_irqsave+0x44/0x56
[  492.591358]   [<ffffffffa009b1be>] qla24xx_msix_default+0x5c/0x2aa [qla2xxx]
[  492.599227]   [<ffffffff81088582>] handle_irq_event_percpu+0x5a/0x197
[  492.606413]   [<ffffffff810886fb>] handle_irq_event+0x3c/0x5c
[  492.612822]   [<ffffffff8108a6dc>] handle_edge_irq+0xcc/0xf1
[  492.619138]   [<ffffffff810039b9>] handle_irq+0x83/0x8e
[  492.624971]   [<ffffffff8100333e>] do_IRQ+0x48/0xaf
[  492.630413]   [<ffffffff813a7cd3>] ret_from_intr+0x0/0x1a
[  492.636437]   [<ffffffff81001dc1>] cpu_idle+0x5b/0x8d
[  492.642073]   [<ffffffff81392709>] rest_init+0xad/0xb4
[  492.647809]   [<ffffffff81a1cbbc>] start_kernel+0x366/0x371
[  492.654030]   [<ffffffff81a1c2b1>] x86_64_start_reservations+0xb8/0xbc
[  492.661311]   [<ffffffff81a1c3b6>] x86_64_start_kernel+0x101/0x110
[  492.668204]
[  492.668205] to a HARDIRQ-irq-unsafe lock:
[  492.674324] (&(&se_tpg->session_lock)->rlock){+.....}
[  492.679862] ... which became HARDIRQ-irq-unsafe at:
[  492.685497] ...  [<ffffffff8106479a>] __lock_acquire+0x709/0x921
[  492.692209]   [<ffffffff81064eff>] lock_acquire+0xe0/0x10d
[  492.698330]   [<ffffffff813a75ed>] _raw_spin_lock_bh+0x31/0x40
[  492.704836]   [<ffffffffa021c208>] core_tpg_del_initiator_node_acl+0x89/0x336 [target_core_mod]
[  492.714546]   [<ffffffffa02fb075>] tcm_qla2xxx_drop_nodeacl+0x20/0x2d [tcm_qla2xxx]
[  492.723087]   [<ffffffffa02108d9>] target_fabric_nacl_base_release+0x22/0x24 [target_core_mod]
[  492.732698]   [<ffffffffa01661c8>] config_item_release+0x7d/0xa3 [configfs]
[  492.740465]   [<ffffffff811d48fe>] kref_put+0x43/0x4d
[  492.746101]   [<ffffffffa0166149>] config_item_put+0x19/0x1b [configfs]
[  492.753481]   [<ffffffffa0164987>] configfs_rmdir+0x1eb/0x258 [configfs]
[  492.760957]   [<ffffffff810ecc54>] vfs_rmdir+0x79/0xd0
[  492.766690]   [<ffffffff810eec4a>] do_rmdir+0xc2/0x111
[  492.772423]   [<ffffffff810eecd0>] sys_rmdir+0x11/0x13
[  492.778156]   [<ffffffff813ae4d2>] system_call_fastpath+0x16/0x1b
[  492.784953]

Signed-off-by: Nicholas Bellinger <[email protected]>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 31, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
vineetgarc referenced this pull request in foss-for-synopsys-dwc-arc-processors/linux Oct 31, 2012
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Nov 14, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
kees pushed a commit to kees/linux that referenced this pull request Nov 16, 2012
…d reasons

BugLink: http://bugs.launchpad.net/bugs/1035435

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     torvalds#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     torvalds#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     torvalds#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     torvalds#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
Signed-off-by: Herton Ronaldo Krzesinski <[email protected]>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Nov 21, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
vineetgarc referenced this pull request in foss-for-synopsys-dwc-arc-processors/linux Dec 31, 2012
cianmcgovern pushed a commit to cianmcgovern/linux that referenced this pull request Mar 10, 2013
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     torvalds#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     torvalds#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     torvalds#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     torvalds#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
tom3q pushed a commit to tom3q/linux that referenced this pull request Apr 28, 2013
…failed

This bug could be triggered if 1st interface configuration fails:
Apr  8 18:20:30 homeserver kernel: usb 5-1: new low-speed USB device number 2 using ohci_hcd
Apr  8 18:20:30 homeserver kernel: input: iMON Panel, Knob and Mouse(15c2:0036) as /devices/pci0000:00/0000:00:13.0/usb5/5-1/5-1:1.0/input/input2
Apr  8 18:20:30 homeserver kernel: Registered IR keymap rc-imon-pad
Apr  8 18:20:30 homeserver kernel: input: iMON Remote (15c2:0036) as /devices/pci0000:00/0000:00:13.0/usb5/5-1/5-1:1.0/rc/rc0/input3
Apr  8 18:20:30 homeserver kernel: rc0: iMON Remote (15c2:0036) as /devices/pci0000:00/0000:00:13.0/usb5/5-1/5-1:1.0/rc/rc0
Apr  8 18:20:30 homeserver kernel: imon:send_packet: packet tx failed (-32)
Apr  8 18:20:30 homeserver kernel: imon 5-1:1.0: remote input dev register failed
Apr  8 18:20:30 homeserver kernel: imon 5-1:1.0: imon_init_intf0: rc device setup failed
Apr  8 18:20:30 homeserver kernel: imon 5-1:1.0: unable to initialize intf0, err 0
Apr  8 18:20:30 homeserver kernel: imon:imon_probe: failed to initialize context!
Apr  8 18:20:30 homeserver kernel: imon 5-1:1.0: unable to register, err -19
Apr  8 18:20:30 homeserver kernel: BUG: unable to handle kernel NULL pointer dereference at 00000014
Apr  8 18:20:30 homeserver kernel: IP: [<c05c4e4c>] mutex_lock+0xc/0x30
Apr  8 18:20:30 homeserver kernel: *pde = 00000000
Apr  8 18:20:30 homeserver kernel: Oops: 0002 [#1] PREEMPT SMP
Apr  8 18:20:30 homeserver kernel: Modules linked in:
Apr  8 18:20:30 homeserver kernel: Pid: 367, comm: khubd Not tainted 3.8.3-htpc-00002-g79b1403 torvalds#23 Unknow Unknow/RS780-SB700
Apr  8 18:20:30 homeserver kernel: EIP: 0060:[<c05c4e4c>] EFLAGS: 00010296 CPU: 1
Apr  8 18:20:30 homeserver kernel: EIP is at mutex_lock+0xc/0x30
Apr  8 18:20:30 homeserver kernel: EAX: 00000014 EBX: 00000014 ECX: 00000000 EDX: f590e480
Apr  8 18:20:30 homeserver kernel: ESI: f5deac00 EDI: f590e480 EBP: f5f3ee00 ESP: f6577c28
Apr  8 18:20:30 homeserver kernel: DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Apr  8 18:20:30 homeserver kernel: CR0: 8005003b CR2: 00000014 CR3: 0081b000 CR4: 000007d0
Apr  8 18:20:30 homeserver kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Apr  8 18:20:30 homeserver kernel: DR6: ffff0ff0 DR7: 00000400
Apr  8 18:20:30 homeserver kernel: Process khubd (pid: 367, ti=f6576000 task=f649ea00 task.ti=f6576000)
Apr  8 18:20:30 homeserver kernel: Stack:
Apr  8 18:20:30 homeserver kernel: 00000000 f5deac00 c0448de4 f59714c0 f5deac64 c03b8ad2 f6577c90 00000004
Apr  8 18:20:30 homeserver kernel: f649ea00 c0205142 f6779820 a1ff7f08 f5deac00 00000001 f5f3ee1c 00000014
Apr  8 18:20:30 homeserver kernel: 00000004 00000202 15c20036 c07a03e8 fffee0ca f6795c00 f5f3ee1c f5deac00
Apr  8 18:20:30 homeserver kernel: Call Trace:
Apr  8 18:20:30 homeserver kernel: [<c0448de4>] ? imon_probe+0x494/0xde0
Apr  8 18:20:30 homeserver kernel: [<c03b8ad2>] ? rpm_resume+0xb2/0x4f0
Apr  8 18:20:30 homeserver kernel: [<c0205142>] ? sysfs_addrm_finish+0x12/0x90
Apr  8 18:20:30 homeserver kernel: [<c04170e9>] ? usb_probe_interface+0x169/0x240
Apr  8 18:20:30 homeserver kernel: [<c03b0ca0>] ? __driver_attach+0x80/0x80
Apr  8 18:20:30 homeserver kernel: [<c03b0ca0>] ? __driver_attach+0x80/0x80
Apr  8 18:20:30 homeserver kernel: [<c03b0a94>] ? driver_probe_device+0x54/0x1e0
Apr  8 18:20:30 homeserver kernel: [<c0416abe>] ? usb_device_match+0x4e/0x80
Apr  8 18:20:30 homeserver kernel: [<c03af314>] ? bus_for_each_drv+0x34/0x70
Apr  8 18:20:30 homeserver kernel: [<c03b0a0b>] ? device_attach+0x7b/0x90
Apr  8 18:20:30 homeserver kernel: [<c03b0ca0>] ? __driver_attach+0x80/0x80
Apr  8 18:20:30 homeserver kernel: [<c03b00ff>] ? bus_probe_device+0x5f/0x80
Apr  8 18:20:30 homeserver kernel: [<c03aeab7>] ? device_add+0x567/0x610
Apr  8 18:20:30 homeserver kernel: [<c041a7bc>] ? usb_create_ep_devs+0x7c/0xd0
Apr  8 18:20:30 homeserver kernel: [<c0413837>] ? create_intf_ep_devs+0x47/0x70
Apr  8 18:20:30 homeserver kernel: [<c04156c4>] ? usb_set_configuration+0x454/0x750
Apr  8 18:20:30 homeserver kernel: [<c03b0ca0>] ? __driver_attach+0x80/0x80
Apr  8 18:20:30 homeserver kernel: [<c041de8a>] ? generic_probe+0x2a/0x80
Apr  8 18:20:30 homeserver kernel: [<c03b0ca0>] ? __driver_attach+0x80/0x80
Apr  8 18:20:30 homeserver kernel: [<c0205aff>] ? sysfs_create_link+0xf/0x20
Apr  8 18:20:30 homeserver kernel: [<c04171db>] ? usb_probe_device+0x1b/0x40
Apr  8 18:20:30 homeserver kernel: [<c03b0a94>] ? driver_probe_device+0x54/0x1e0
Apr  8 18:20:30 homeserver kernel: [<c03af314>] ? bus_for_each_drv+0x34/0x70
Apr  8 18:20:30 homeserver kernel: [<c03b0a0b>] ? device_attach+0x7b/0x90
Apr  8 18:20:30 homeserver kernel: [<c03b0ca0>] ? __driver_attach+0x80/0x80
Apr  8 18:20:30 homeserver kernel: [<c03b00ff>] ? bus_probe_device+0x5f/0x80
Apr  8 18:20:30 homeserver kernel: [<c03aeab7>] ? device_add+0x567/0x610
Apr  8 18:20:30 homeserver kernel: [<c040e6df>] ? usb_new_device+0x12f/0x1e0
Apr  8 18:20:30 homeserver kernel: [<c040f4d8>] ? hub_thread+0x458/0x1230
Apr  8 18:20:30 homeserver kernel: [<c015554f>] ? dequeue_task_fair+0x9f/0xc0
Apr  8 18:20:30 homeserver kernel: [<c0131312>] ? release_task+0x1d2/0x330
Apr  8 18:20:30 homeserver kernel: [<c01477b0>] ? abort_exclusive_wait+0x90/0x90
Apr  8 18:20:30 homeserver kernel: [<c040f080>] ? usb_remote_wakeup+0x40/0x40
Apr  8 18:20:30 homeserver kernel: [<c0146ed2>] ? kthread+0x92/0xa0
Apr  8 18:20:30 homeserver kernel: [<c05c7877>] ? ret_from_kernel_thread+0x1b/0x28
Apr  8 18:20:30 homeserver kernel: [<c0146e40>] ? kthread_freezable_should_stop+0x50/0x50
Apr  8 18:20:30 homeserver kernel: Code: 89 04 24 89 f0 e8 05 ff ff ff 8b 5c 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 83 c4 34 c3 00 83 ec 08 89 1c 24 89 74 24 04 89 c3 <f0> ff 08 79 05 e8 ca 03 00 00 64 a1 70 d6 80 c0 8b 74 24 04 89
Apr  8 18:20:30 homeserver kernel: EIP: [<c05c4e4c>] mutex_lock+0xc/0x30 SS:ESP 0068:f6577c28
Apr  8 18:20:30 homeserver kernel: CR2: 0000000000000014
Apr  8 18:20:30 homeserver kernel: ---[ end trace df134132c967205c ]---

Signed-off-by: Kevin Baradon <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
@torvalds torvalds merged commit 131a80b into torvalds:master Jul 4, 2013
torvalds pushed a commit that referenced this pull request Jul 10, 2013
Several people reported the warning: "kernel BUG at kernel/timer.c:729!"
and the stack trace is:

	#7 [ffff880214d25c10] mod_timer+501 at ffffffff8106d905
	#8 [ffff880214d25c50] br_multicast_del_pg.isra.20+261 at ffffffffa0731d25 [bridge]
	#9 [ffff880214d25c80] br_multicast_disable_port+88 at ffffffffa0732948 [bridge]
	#10 [ffff880214d25cb0] br_stp_disable_port+154 at ffffffffa072bcca [bridge]
	#11 [ffff880214d25ce8] br_device_event+520 at ffffffffa072a4e8 [bridge]
	#12 [ffff880214d25d18] notifier_call_chain+76 at ffffffff8164aafc
	#13 [ffff880214d25d50] raw_notifier_call_chain+22 at ffffffff810858f6
	#14 [ffff880214d25d60] call_netdevice_notifiers+45 at ffffffff81536aad
	#15 [ffff880214d25d80] dev_close_many+183 at ffffffff81536d17
	#16 [ffff880214d25dc0] rollback_registered_many+168 at ffffffff81537f68
	#17 [ffff880214d25de8] rollback_registered+49 at ffffffff81538101
	#18 [ffff880214d25e10] unregister_netdevice_queue+72 at ffffffff815390d8
	#19 [ffff880214d25e30] __tun_detach+272 at ffffffffa074c2f0 [tun]
	#20 [ffff880214d25e88] tun_chr_close+45 at ffffffffa074c4bd [tun]
	#21 [ffff880214d25ea8] __fput+225 at ffffffff8119b1f1
	#22 [ffff880214d25ef0] ____fput+14 at ffffffff8119b3fe
	#23 [ffff880214d25f00] task_work_run+159 at ffffffff8107cf7f
	#24 [ffff880214d25f30] do_notify_resume+97 at ffffffff810139e1
	#25 [ffff880214d25f50] int_signal+18 at ffffffff8164f292

this is due to I forgot to check if mp->timer is armed in
br_multicast_del_pg(). This bug is introduced by
commit 9f00b2e (bridge: only expire the mdb entry
when query is received).

Same for __br_mdb_del().

Tested-by: poma <[email protected]>
Reported-by: LiYonghua <[email protected]>
Reported-by: Robert Hancock <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: "David S. Miller" <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Sep 11, 2013
When booting secondary CPUs, announce_cpu() is called to show which cpu has
been brought up. For example:

[    0.402751] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 OK
[    0.525667] smpboot: Booting Node   1, Processors  torvalds#6 torvalds#7 torvalds#8 torvalds#9 torvalds#10 torvalds#11 OK
[    0.755592] smpboot: Booting Node   0, Processors  torvalds#12 torvalds#13 torvalds#14 torvalds#15 torvalds#16 torvalds#17 OK
[    0.890495] smpboot: Booting Node   1, Processors  torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23

But the last "OK" is lost, because 'nr_cpu_ids-1' represents the maximum
possible cpu id. It should use the maximum present cpu id in case not all
CPUs booted up.

Signed-off-by: Libin <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ tweaked the changelog, removed unnecessary line break, tweaked the format to align the fields vertically. ]
Signed-off-by: Ingo Molnar <[email protected]>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 1, 2013
The driver uses platform_driver_probe() to obtain platform data
if any. However, that function is placed in the .init section so
it must be called upon driver module initialization.

The problem was reported by Fenguang Wu resulting in a kernel
oops because the .init section was already freed.

[   48.966342] Switched to clocksource tsc
[   48.970002] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   48.970851] BUG: unable to handle kernel paging request at ffffffff82196446
[   48.970957] IP: [<ffffffff82196446>] classes_init+0x26/0x26
[   48.970957] PGD 1e76067 PUD 1e77063 PMD f388063 PTE 8000000002196163
[   48.970957] Oops: 0011 [#1]
[   48.970957] CPU: 0 PID: 17 Comm: kworker/0:1 Not tainted 3.11.0-rc7-00444-gc52dd7f torvalds#23
[   48.970957] Workqueue: events brcmf_driver_init
[   48.970957] task: ffff8800001d2000 ti: ffff8800001d4000 task.ti: ffff8800001d4000
[   48.970957] RIP: 0010:[<ffffffff82196446>]  [<ffffffff82196446>] classes_init+0x26/0x26
[   48.970957] RSP: 0000:ffff8800001d5d40  EFLAGS: 00000286
[   48.970957] RAX: 0000000000000001 RBX: ffffffff820c5620 RCX: 0000000000000000
[   48.970957] RDX: 0000000000000001 RSI: ffffffff816f7380 RDI: ffffffff820c56c0
[   48.970957] RBP: ffff8800001d5d50 R08: ffff8800001d2508 R09: 0000000000000002
[   48.970957] R10: 0000000000000000 R11: 0001f7ce298c5620 R12: ffff8800001c76b0
[   48.970957] R13: ffffffff81e91d40 R14: 0000000000000000 R15: ffff88000e0ce300
[   48.970957] FS:  0000000000000000(0000) GS:ffffffff81e84000(0000) knlGS:0000000000000000
[   48.970957] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   48.970957] CR2: ffffffff82196446 CR3: 0000000001e75000 CR4: 00000000000006b0
[   48.970957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.970957] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[   48.970957] Stack:
[   48.970957]  ffffffff816f7df8 ffffffff820c5620 ffff8800001d5d60 ffffffff816eeec9
[   48.970957]  ffff8800001d5de0 ffffffff81073dc5 ffffffff81073d68 ffff8800001d5db8
[   48.970957]  0000000000000086 ffffffff820c5620 ffffffff824f7fd0 0000000000000000
[   48.970957] Call Trace:
[   48.970957]  [<ffffffff816f7df8>] ? brcmf_sdio_init+0x18/0x70
[   48.970957]  [<ffffffff816eeec9>] brcmf_driver_init+0x9/0x10
[   48.970957]  [<ffffffff81073dc5>] process_one_work+0x1d5/0x480
[   48.970957]  [<ffffffff81073d68>] ? process_one_work+0x178/0x480
[   48.970957]  [<ffffffff81074188>] worker_thread+0x118/0x3a0
[   48.970957]  [<ffffffff81074070>] ? process_one_work+0x480/0x480
[   48.970957]  [<ffffffff8107aa17>] kthread+0xe7/0xf0
[   48.970957]  [<ffffffff810829f7>] ? finish_task_switch.constprop.57+0x37/0xd0
[   48.970957]  [<ffffffff8107a930>] ? __kthread_parkme+0x80/0x80
[   48.970957]  [<ffffffff81a6923a>] ret_from_fork+0x7a/0xb0
[   48.970957]  [<ffffffff8107a930>] ? __kthread_parkme+0x80/0x80
[   48.970957] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc <cc> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[   48.970957] RIP  [<ffffffff82196446>] classes_init+0x26/0x26
[   48.970957]  RSP <ffff8800001d5d40>
[   48.970957] CR2: ffffffff82196446
[   48.970957] ---[ end trace 62980817cd525f14 ]---

Cc: <[email protected]> # 3.10.x, 3.11.x
Reported-by: Fengguang Wu <[email protected]>
Reviewed-by: Hante Meuleman <[email protected]>
Reviewed-by: Pieter-Paul Giesberts <[email protected]>
Tested-by: Fengguang Wu <[email protected]>
Signed-off-by: Arend van Spriel <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
heftig referenced this pull request in zen-kernel/zen-kernel Oct 14, 2013
commit db4efbb upstream.

The driver uses platform_driver_probe() to obtain platform data
if any. However, that function is placed in the .init section so
it must be called upon driver module initialization.

The problem was reported by Fenguang Wu resulting in a kernel
oops because the .init section was already freed.

[   48.966342] Switched to clocksource tsc
[   48.970002] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   48.970851] BUG: unable to handle kernel paging request at ffffffff82196446
[   48.970957] IP: [<ffffffff82196446>] classes_init+0x26/0x26
[   48.970957] PGD 1e76067 PUD 1e77063 PMD f388063 PTE 8000000002196163
[   48.970957] Oops: 0011 [#1]
[   48.970957] CPU: 0 PID: 17 Comm: kworker/0:1 Not tainted 3.11.0-rc7-00444-gc52dd7f #23
[   48.970957] Workqueue: events brcmf_driver_init
[   48.970957] task: ffff8800001d2000 ti: ffff8800001d4000 task.ti: ffff8800001d4000
[   48.970957] RIP: 0010:[<ffffffff82196446>]  [<ffffffff82196446>] classes_init+0x26/0x26
[   48.970957] RSP: 0000:ffff8800001d5d40  EFLAGS: 00000286
[   48.970957] RAX: 0000000000000001 RBX: ffffffff820c5620 RCX: 0000000000000000
[   48.970957] RDX: 0000000000000001 RSI: ffffffff816f7380 RDI: ffffffff820c56c0
[   48.970957] RBP: ffff8800001d5d50 R08: ffff8800001d2508 R09: 0000000000000002
[   48.970957] R10: 0000000000000000 R11: 0001f7ce298c5620 R12: ffff8800001c76b0
[   48.970957] R13: ffffffff81e91d40 R14: 0000000000000000 R15: ffff88000e0ce300
[   48.970957] FS:  0000000000000000(0000) GS:ffffffff81e84000(0000) knlGS:0000000000000000
[   48.970957] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   48.970957] CR2: ffffffff82196446 CR3: 0000000001e75000 CR4: 00000000000006b0
[   48.970957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.970957] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[   48.970957] Stack:
[   48.970957]  ffffffff816f7df8 ffffffff820c5620 ffff8800001d5d60 ffffffff816eeec9
[   48.970957]  ffff8800001d5de0 ffffffff81073dc5 ffffffff81073d68 ffff8800001d5db8
[   48.970957]  0000000000000086 ffffffff820c5620 ffffffff824f7fd0 0000000000000000
[   48.970957] Call Trace:
[   48.970957]  [<ffffffff816f7df8>] ? brcmf_sdio_init+0x18/0x70
[   48.970957]  [<ffffffff816eeec9>] brcmf_driver_init+0x9/0x10
[   48.970957]  [<ffffffff81073dc5>] process_one_work+0x1d5/0x480
[   48.970957]  [<ffffffff81073d68>] ? process_one_work+0x178/0x480
[   48.970957]  [<ffffffff81074188>] worker_thread+0x118/0x3a0
[   48.970957]  [<ffffffff81074070>] ? process_one_work+0x480/0x480
[   48.970957]  [<ffffffff8107aa17>] kthread+0xe7/0xf0
[   48.970957]  [<ffffffff810829f7>] ? finish_task_switch.constprop.57+0x37/0xd0
[   48.970957]  [<ffffffff8107a930>] ? __kthread_parkme+0x80/0x80
[   48.970957]  [<ffffffff81a6923a>] ret_from_fork+0x7a/0xb0
[   48.970957]  [<ffffffff8107a930>] ? __kthread_parkme+0x80/0x80
[   48.970957] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc <cc> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[   48.970957] RIP  [<ffffffff82196446>] classes_init+0x26/0x26
[   48.970957]  RSP <ffff8800001d5d40>
[   48.970957] CR2: ffffffff82196446
[   48.970957] ---[ end trace 62980817cd525f14 ]---

Reported-by: Fengguang Wu <[email protected]>
Reviewed-by: Hante Meuleman <[email protected]>
Reviewed-by: Pieter-Paul Giesberts <[email protected]>
Tested-by: Fengguang Wu <[email protected]>
Signed-off-by: Arend van Spriel <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 14, 2013
As the new x86 CPU bootup printout format code maintainer, I am
taking immediate action to improve and clean (and thus indulge
my OCD) the reporting of the cores when coming up online.

Fix padding to a right-hand alignment, cleanup code and bind
reporting width to the max number of supported CPUs on the
system, like this:

 [    0.074509] smpboot: Booting Node   0, Processors:      #1  #2  #3  #4  #5  torvalds#6  torvalds#7 OK
 [    0.644008] smpboot: Booting Node   1, Processors:  torvalds#8  torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 OK
 [    1.245006] smpboot: Booting Node   2, Processors: torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 OK
 [    1.864005] smpboot: Booting Node   3, Processors: torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31 OK
 [    2.489005] smpboot: Booting Node   4, Processors: torvalds#32 torvalds#33 torvalds#34 torvalds#35 torvalds#36 torvalds#37 torvalds#38 torvalds#39 OK
 [    3.093005] smpboot: Booting Node   5, Processors: torvalds#40 torvalds#41 torvalds#42 torvalds#43 torvalds#44 torvalds#45 torvalds#46 torvalds#47 OK
 [    3.698005] smpboot: Booting Node   6, Processors: torvalds#48 torvalds#49 torvalds#50 torvalds#51 #52 #53 torvalds#54 torvalds#55 OK
 [    4.304005] smpboot: Booting Node   7, Processors: torvalds#56 torvalds#57 #58 torvalds#59 torvalds#60 torvalds#61 torvalds#62 torvalds#63 OK
 [    4.961413] Brought up 64 CPUs

and this:

 [    0.072367] smpboot: Booting Node   0, Processors:    #1 #2 #3 #4 #5 torvalds#6 torvalds#7 OK
 [    0.686329] Brought up 8 CPUs

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Libin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 14, 2013
Turn it into (for example):

[    0.073380] x86: Booting SMP configuration:
[    0.074005] .... node   #0, CPUs:          #1   #2   #3   #4   #5   torvalds#6   torvalds#7
[    0.603005] .... node   #1, CPUs:     torvalds#8   torvalds#9  torvalds#10  torvalds#11  torvalds#12  torvalds#13  torvalds#14  torvalds#15
[    1.200005] .... node   #2, CPUs:    torvalds#16  torvalds#17  torvalds#18  torvalds#19  torvalds#20  torvalds#21  torvalds#22  torvalds#23
[    1.796005] .... node   #3, CPUs:    torvalds#24  torvalds#25  torvalds#26  torvalds#27  torvalds#28  torvalds#29  torvalds#30  torvalds#31
[    2.393005] .... node   #4, CPUs:    torvalds#32  torvalds#33  torvalds#34  torvalds#35  torvalds#36  torvalds#37  torvalds#38  torvalds#39
[    2.996005] .... node   #5, CPUs:    torvalds#40  torvalds#41  torvalds#42  torvalds#43  torvalds#44  torvalds#45  torvalds#46  torvalds#47
[    3.600005] .... node   torvalds#6, CPUs:    torvalds#48  torvalds#49  torvalds#50  torvalds#51  #52  #53  torvalds#54  torvalds#55
[    4.202005] .... node   torvalds#7, CPUs:    torvalds#56  torvalds#57  #58  torvalds#59  torvalds#60  torvalds#61  torvalds#62  torvalds#63
[    4.811005] .... node   torvalds#8, CPUs:    torvalds#64  torvalds#65  torvalds#66  torvalds#67  torvalds#68  torvalds#69  #70  torvalds#71
[    5.421006] .... node   torvalds#9, CPUs:    torvalds#72  torvalds#73  torvalds#74  torvalds#75  torvalds#76  torvalds#77  torvalds#78  torvalds#79
[    6.032005] .... node  torvalds#10, CPUs:    torvalds#80  torvalds#81  torvalds#82  torvalds#83  torvalds#84  torvalds#85  torvalds#86  torvalds#87
[    6.648006] .... node  torvalds#11, CPUs:    torvalds#88  torvalds#89  torvalds#90  torvalds#91  torvalds#92  torvalds#93  torvalds#94  torvalds#95
[    7.262005] .... node  torvalds#12, CPUs:    torvalds#96  torvalds#97  torvalds#98  torvalds#99 torvalds#100 torvalds#101 torvalds#102 torvalds#103
[    7.865005] .... node  torvalds#13, CPUs:   torvalds#104 torvalds#105 torvalds#106 torvalds#107 torvalds#108 torvalds#109 torvalds#110 torvalds#111
[    8.466005] .... node  torvalds#14, CPUs:   torvalds#112 torvalds#113 torvalds#114 torvalds#115 torvalds#116 torvalds#117 torvalds#118 torvalds#119
[    9.073006] .... node  torvalds#15, CPUs:   torvalds#120 torvalds#121 torvalds#122 torvalds#123 torvalds#124 torvalds#125 torvalds#126 torvalds#127
[    9.679901] x86: Booted up 16 nodes, 128 CPUs

and drop useless elements.

Change num_digits() to hpa's division-avoiding, cell-phone-typed
version which he went at great lengths and pains to submit on a
Saturday evening.

Signed-off-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Jul 24, 2025
[ Upstream commit b640daa ]

Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers
the splat below [0].

rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after
skb_cow_head(), which is illegal as the header could be freed then.

Let's fix it by making oldhdr to a local struct instead of a pointer.

[0]:
[root@fedora net]# ./lwt_dst_cache_ref_loop.sh
...
TEST: rpl (input)
[   57.631529] ==================================================================
BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
Read of size 40 at addr ffff888122bf96d8 by task ping6/1543

CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 torvalds#23 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636)
 kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1))
 __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2))
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 </IRQ>
 <TASK>
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f68cffb2a06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06
RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003
RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4
R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0
 </TASK>

Allocated by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88))
 __alloc_skb (net/core/skbuff.c:669)
 __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1))
 ip6_append_data (net/ipv6/ip6_output.c:1859)
 rawv6_sendmsg (net/ipv6/raw.c:911)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 pskb_expand_head (net/core/skbuff.c:2274)
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1))
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888122bf96c0
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 24 bytes inside of
 freed 704-byte region [ffff888122bf96c0, ffff888122bf9980)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
 ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: a7a29f9 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Jul 24, 2025
commit d992ed7 upstream.

When device reset is triggered by feature changes such as toggling Rx
VLAN offload, wx->do_reset() is called to reinitialize Rx rings. The
hardware descriptor ring may retain stale values from previous sessions.
And only set the length to 0 in rx_desc[0] would result in building
malformed SKBs. Fix it to ensure a clean slate after device reset.

[  549.186435] [     C16] ------------[ cut here ]------------
[  549.186457] [     C16] kernel BUG at net/core/skbuff.c:2814!
[  549.186468] [     C16] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  549.186472] [     C16] CPU: 16 UID: 0 PID: 0 Comm: swapper/16 Kdump: loaded Not tainted 6.16.0-rc4+ torvalds#23 PREEMPT(voluntary)
[  549.186476] [     C16] Hardware name: Micro-Star International Co., Ltd. MS-7E16/X670E GAMING PLUS WIFI (MS-7E16), BIOS 1.90 12/31/2024
[  549.186478] [     C16] RIP: 0010:__pskb_pull_tail+0x3ff/0x510
[  549.186484] [     C16] Code: 06 f0 ff 4f 34 74 7b 4d 8b 8c 24 c8 00 00 00 45 8b 84 24 c0 00 00 00 e9 c8 fd ff ff 48 c7 44 24 08 00 00 00 00 e9 5e fe ff ff <0f> 0b 31 c0 e9 23 90 5b ff 41 f7 c6 ff 0f 00 00 75 bf 49 8b 06 a8
[  549.186487] [     C16] RSP: 0018:ffffb391c0640d70 EFLAGS: 00010282
[  549.186490] [     C16] RAX: 00000000fffffff2 RBX: ffff8fe7e4d40200 RCX: 00000000fffffff2
[  549.186492] [     C16] RDX: ffff8fe7c3a4bf8e RSI: 0000000000000180 RDI: ffff8fe7c3a4bf40
[  549.186494] [     C16] RBP: ffffb391c0640da8 R08: ffff8fe7c3a4c0c0 R09: 000000000000000e
[  549.186496] [     C16] R10: ffffb391c0640d88 R11: 000000000000000e R12: ffff8fe7e4d40200
[  549.186497] [     C16] R13: 00000000fffffff2 R14: ffff8fe7fa01a000 R15: 00000000fffffff2
[  549.186499] [     C16] FS:  0000000000000000(0000) GS:ffff8fef5ae40000(0000) knlGS:0000000000000000
[  549.186502] [     C16] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  549.186503] [     C16] CR2: 00007f77d81d6000 CR3: 000000051a032000 CR4: 0000000000750ef0
[  549.186505] [     C16] PKRU: 55555554
[  549.186507] [     C16] Call Trace:
[  549.186510] [     C16]  <IRQ>
[  549.186513] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186517] [     C16]  __skb_pad+0xc7/0xf0
[  549.186523] [     C16]  wx_clean_rx_irq+0x355/0x3b0 [libwx]
[  549.186533] [     C16]  wx_poll+0x92/0x120 [libwx]
[  549.186540] [     C16]  __napi_poll+0x28/0x190
[  549.186544] [     C16]  net_rx_action+0x301/0x3f0
[  549.186548] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186551] [     C16]  ? __raw_spin_lock_irqsave+0x1e/0x50
[  549.186554] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186557] [     C16]  ? wake_up_nohz_cpu+0x35/0x160
[  549.186559] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186563] [     C16]  handle_softirqs+0xf9/0x2c0
[  549.186568] [     C16]  __irq_exit_rcu+0xc7/0x130
[  549.186572] [     C16]  common_interrupt+0xb8/0xd0
[  549.186576] [     C16]  </IRQ>
[  549.186577] [     C16]  <TASK>
[  549.186579] [     C16]  asm_common_interrupt+0x22/0x40
[  549.186582] [     C16] RIP: 0010:cpuidle_enter_state+0xc2/0x420
[  549.186585] [     C16] Code: 00 00 e8 11 0e 5e ff e8 ac f0 ff ff 49 89 c5 0f 1f 44 00 00 31 ff e8 0d ed 5c ff 45 84 ff 0f 85 40 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 84 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[  549.186587] [     C16] RSP: 0018:ffffb391c0277e78 EFLAGS: 00000246
[  549.186590] [     C16] RAX: ffff8fef5ae40000 RBX: 0000000000000003 RCX: 0000000000000000
[  549.186591] [     C16] RDX: 0000007fde0faac5 RSI: ffffffff826e53f6 RDI: ffffffff826fa9b3
[  549.186593] [     C16] RBP: ffff8fe7c3a20800 R08: 0000000000000002 R09: 0000000000000000
[  549.186595] [     C16] R10: 0000000000000000 R11: 000000000000ffff R12: ffffffff82ed7a40
[  549.186596] [     C16] R13: 0000007fde0faac5 R14: 0000000000000003 R15: 0000000000000000
[  549.186601] [     C16]  ? cpuidle_enter_state+0xb3/0x420
[  549.186605] [     C16]  cpuidle_enter+0x29/0x40
[  549.186609] [     C16]  cpuidle_idle_call+0xfd/0x170
[  549.186613] [     C16]  do_idle+0x7a/0xc0
[  549.186616] [     C16]  cpu_startup_entry+0x25/0x30
[  549.186618] [     C16]  start_secondary+0x117/0x140
[  549.186623] [     C16]  common_startup_64+0x13e/0x148
[  549.186628] [     C16]  </TASK>

Fixes: 3c47e8a ("net: libwx: Support to receive packets in NAPI")
Cc: [email protected]
Signed-off-by: Jiawen Wu <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Jul 24, 2025
[ Upstream commit b640daa ]

Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers
the splat below [0].

rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after
skb_cow_head(), which is illegal as the header could be freed then.

Let's fix it by making oldhdr to a local struct instead of a pointer.

[0]:
[root@fedora net]# ./lwt_dst_cache_ref_loop.sh
...
TEST: rpl (input)
[   57.631529] ==================================================================
BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
Read of size 40 at addr ffff888122bf96d8 by task ping6/1543

CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 torvalds#23 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636)
 kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1))
 __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2))
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 </IRQ>
 <TASK>
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f68cffb2a06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06
RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003
RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4
R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0
 </TASK>

Allocated by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88))
 __alloc_skb (net/core/skbuff.c:669)
 __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1))
 ip6_append_data (net/ipv6/ip6_output.c:1859)
 rawv6_sendmsg (net/ipv6/raw.c:911)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 pskb_expand_head (net/core/skbuff.c:2274)
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1))
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888122bf96c0
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 24 bytes inside of
 freed 704-byte region [ffff888122bf96c0, ffff888122bf9980)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
 ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: a7a29f9 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Jul 24, 2025
[ Upstream commit b640daa ]

Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers
the splat below [0].

rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after
skb_cow_head(), which is illegal as the header could be freed then.

Let's fix it by making oldhdr to a local struct instead of a pointer.

[0]:
[root@fedora net]# ./lwt_dst_cache_ref_loop.sh
...
TEST: rpl (input)
[   57.631529] ==================================================================
BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
Read of size 40 at addr ffff888122bf96d8 by task ping6/1543

CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 torvalds#23 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636)
 kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1))
 __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2))
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 </IRQ>
 <TASK>
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f68cffb2a06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06
RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003
RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4
R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0
 </TASK>

Allocated by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88))
 __alloc_skb (net/core/skbuff.c:669)
 __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1))
 ip6_append_data (net/ipv6/ip6_output.c:1859)
 rawv6_sendmsg (net/ipv6/raw.c:911)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 pskb_expand_head (net/core/skbuff.c:2274)
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1))
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888122bf96c0
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 24 bytes inside of
 freed 704-byte region [ffff888122bf96c0, ffff888122bf9980)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
 ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: a7a29f9 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Jul 24, 2025
commit d992ed7 upstream.

When device reset is triggered by feature changes such as toggling Rx
VLAN offload, wx->do_reset() is called to reinitialize Rx rings. The
hardware descriptor ring may retain stale values from previous sessions.
And only set the length to 0 in rx_desc[0] would result in building
malformed SKBs. Fix it to ensure a clean slate after device reset.

[  549.186435] [     C16] ------------[ cut here ]------------
[  549.186457] [     C16] kernel BUG at net/core/skbuff.c:2814!
[  549.186468] [     C16] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  549.186472] [     C16] CPU: 16 UID: 0 PID: 0 Comm: swapper/16 Kdump: loaded Not tainted 6.16.0-rc4+ torvalds#23 PREEMPT(voluntary)
[  549.186476] [     C16] Hardware name: Micro-Star International Co., Ltd. MS-7E16/X670E GAMING PLUS WIFI (MS-7E16), BIOS 1.90 12/31/2024
[  549.186478] [     C16] RIP: 0010:__pskb_pull_tail+0x3ff/0x510
[  549.186484] [     C16] Code: 06 f0 ff 4f 34 74 7b 4d 8b 8c 24 c8 00 00 00 45 8b 84 24 c0 00 00 00 e9 c8 fd ff ff 48 c7 44 24 08 00 00 00 00 e9 5e fe ff ff <0f> 0b 31 c0 e9 23 90 5b ff 41 f7 c6 ff 0f 00 00 75 bf 49 8b 06 a8
[  549.186487] [     C16] RSP: 0018:ffffb391c0640d70 EFLAGS: 00010282
[  549.186490] [     C16] RAX: 00000000fffffff2 RBX: ffff8fe7e4d40200 RCX: 00000000fffffff2
[  549.186492] [     C16] RDX: ffff8fe7c3a4bf8e RSI: 0000000000000180 RDI: ffff8fe7c3a4bf40
[  549.186494] [     C16] RBP: ffffb391c0640da8 R08: ffff8fe7c3a4c0c0 R09: 000000000000000e
[  549.186496] [     C16] R10: ffffb391c0640d88 R11: 000000000000000e R12: ffff8fe7e4d40200
[  549.186497] [     C16] R13: 00000000fffffff2 R14: ffff8fe7fa01a000 R15: 00000000fffffff2
[  549.186499] [     C16] FS:  0000000000000000(0000) GS:ffff8fef5ae40000(0000) knlGS:0000000000000000
[  549.186502] [     C16] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  549.186503] [     C16] CR2: 00007f77d81d6000 CR3: 000000051a032000 CR4: 0000000000750ef0
[  549.186505] [     C16] PKRU: 55555554
[  549.186507] [     C16] Call Trace:
[  549.186510] [     C16]  <IRQ>
[  549.186513] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186517] [     C16]  __skb_pad+0xc7/0xf0
[  549.186523] [     C16]  wx_clean_rx_irq+0x355/0x3b0 [libwx]
[  549.186533] [     C16]  wx_poll+0x92/0x120 [libwx]
[  549.186540] [     C16]  __napi_poll+0x28/0x190
[  549.186544] [     C16]  net_rx_action+0x301/0x3f0
[  549.186548] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186551] [     C16]  ? __raw_spin_lock_irqsave+0x1e/0x50
[  549.186554] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186557] [     C16]  ? wake_up_nohz_cpu+0x35/0x160
[  549.186559] [     C16]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.186563] [     C16]  handle_softirqs+0xf9/0x2c0
[  549.186568] [     C16]  __irq_exit_rcu+0xc7/0x130
[  549.186572] [     C16]  common_interrupt+0xb8/0xd0
[  549.186576] [     C16]  </IRQ>
[  549.186577] [     C16]  <TASK>
[  549.186579] [     C16]  asm_common_interrupt+0x22/0x40
[  549.186582] [     C16] RIP: 0010:cpuidle_enter_state+0xc2/0x420
[  549.186585] [     C16] Code: 00 00 e8 11 0e 5e ff e8 ac f0 ff ff 49 89 c5 0f 1f 44 00 00 31 ff e8 0d ed 5c ff 45 84 ff 0f 85 40 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 84 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[  549.186587] [     C16] RSP: 0018:ffffb391c0277e78 EFLAGS: 00000246
[  549.186590] [     C16] RAX: ffff8fef5ae40000 RBX: 0000000000000003 RCX: 0000000000000000
[  549.186591] [     C16] RDX: 0000007fde0faac5 RSI: ffffffff826e53f6 RDI: ffffffff826fa9b3
[  549.186593] [     C16] RBP: ffff8fe7c3a20800 R08: 0000000000000002 R09: 0000000000000000
[  549.186595] [     C16] R10: 0000000000000000 R11: 000000000000ffff R12: ffffffff82ed7a40
[  549.186596] [     C16] R13: 0000007fde0faac5 R14: 0000000000000003 R15: 0000000000000000
[  549.186601] [     C16]  ? cpuidle_enter_state+0xb3/0x420
[  549.186605] [     C16]  cpuidle_enter+0x29/0x40
[  549.186609] [     C16]  cpuidle_idle_call+0xfd/0x170
[  549.186613] [     C16]  do_idle+0x7a/0xc0
[  549.186616] [     C16]  cpu_startup_entry+0x25/0x30
[  549.186618] [     C16]  start_secondary+0x117/0x140
[  549.186623] [     C16]  common_startup_64+0x13e/0x148
[  549.186628] [     C16]  </TASK>

Fixes: 3c47e8a ("net: libwx: Support to receive packets in NAPI")
Cc: [email protected]
Signed-off-by: Jiawen Wu <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Jul 24, 2025
[ Upstream commit b640daa ]

Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers
the splat below [0].

rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after
skb_cow_head(), which is illegal as the header could be freed then.

Let's fix it by making oldhdr to a local struct instead of a pointer.

[0]:
[root@fedora net]# ./lwt_dst_cache_ref_loop.sh
...
TEST: rpl (input)
[   57.631529] ==================================================================
BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
Read of size 40 at addr ffff888122bf96d8 by task ping6/1543

CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 torvalds#23 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636)
 kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1))
 __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2))
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 </IRQ>
 <TASK>
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f68cffb2a06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06
RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003
RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4
R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0
 </TASK>

Allocated by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88))
 __alloc_skb (net/core/skbuff.c:669)
 __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1))
 ip6_append_data (net/ipv6/ip6_output.c:1859)
 rawv6_sendmsg (net/ipv6/raw.c:911)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 pskb_expand_head (net/core/skbuff.c:2274)
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1))
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888122bf96c0
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 24 bytes inside of
 freed 704-byte region [ffff888122bf96c0, ffff888122bf9980)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
 ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: a7a29f9 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Jul 31, 2025
Using an uninitialized mutex leads to the following warning when
CONFIG_DEBUG_MUTEXES is enabled.

 DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: CPU: 2 PID: 49 at kernel/locking/mutex.c:577 __mutex_lock+0x690/0x820
 Modules linked in:
 CPU: 2 UID: 0 PID: 49 Comm: kworker/u16:2 Not tainted 6.16.0-04405-g4b290aae788e torvalds#23 PREEMPT
 Hardware name: Freescale i.MX8QM MEK (DT)
 Workqueue: events_unbound deferred_probe_work_func
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : __mutex_lock+0x690/0x820
 lr : __mutex_lock+0x690/0x820
 sp : ffff8000830b3900
 x29: ffff8000830b3960 x28: ffff80008223bac0 x27: 0000000000000000
 x26: ffff000802fc4680 x25: ffff000800019a0d x24: 0000000000000000
 x23: ffff8000806f0d6c x22: 0000000000000002 x21: 0000000000000000
 x20: ffff8000830b3930 x19: ffff000802338090 x18: fffffffffffe6138
 x17: 67657220796d6d75 x16: 6420676e69737520 x15: 0000000000000038
 x14: 0000000000000000 x13: ffff8000822636f0 x12: 000000000000044a
 x11: 000000000000016e x10: ffff8000822bd940 x9 : ffff8000822636f0
 x8 : 00000000ffffefff x7 : ffff8000822bb6f0 x6 : 000000000000016e
 x5 : 000000000000016f x4 : 40000000fffff16e x3 : 0000000000000000
 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0008003b8000
 Call trace:
  __mutex_lock+0x690/0x820 (P)
  mutex_lock_nested+0x24/0x30
  imx_hsio_power_on+0x4c/0x764
  phy_power_on+0x7c/0x12c
  imx_sata_enable+0x1ec/0x488
  imx_ahci_probe+0x1a4/0x560
  platform_probe+0x5c/0x98
  really_probe+0xac/0x298
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0xd8/0x160
  __device_attach_driver+0xb8/0x138
  bus_for_each_drv+0x88/0xe8
  __device_attach+0xa0/0x190
  device_initial_probe+0x14/0x20
  bus_probe_device+0xac/0xb0
  deferred_probe_work_func+0x8c/0xc8
  process_one_work+0x1e4/0x440
  worker_thread+0x1b4/0x350
  kthread+0x138/0x214
  ret_from_fork+0x10/0x20

Fixes: 82c56b6 ("phy: freescale: imx8qm-hsio: Add i.MX8QM HSIO PHY driver support")
Signed-off-by: Quanyang Wang <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Aug 3, 2025
Using an uninitialized mutex leads to the following warning when
CONFIG_DEBUG_MUTEXES is enabled.

 DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: CPU: 2 PID: 49 at kernel/locking/mutex.c:577 __mutex_lock+0x690/0x820
 Modules linked in:
 CPU: 2 UID: 0 PID: 49 Comm: kworker/u16:2 Not tainted 6.16.0-04405-g4b290aae788e torvalds#23 PREEMPT
 Hardware name: Freescale i.MX8QM MEK (DT)
 Workqueue: events_unbound deferred_probe_work_func
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : __mutex_lock+0x690/0x820
 lr : __mutex_lock+0x690/0x820
 sp : ffff8000830b3900
 x29: ffff8000830b3960 x28: ffff80008223bac0 x27: 0000000000000000
 x26: ffff000802fc4680 x25: ffff000800019a0d x24: 0000000000000000
 x23: ffff8000806f0d6c x22: 0000000000000002 x21: 0000000000000000
 x20: ffff8000830b3930 x19: ffff000802338090 x18: fffffffffffe6138
 x17: 67657220796d6d75 x16: 6420676e69737520 x15: 0000000000000038
 x14: 0000000000000000 x13: ffff8000822636f0 x12: 000000000000044a
 x11: 000000000000016e x10: ffff8000822bd940 x9 : ffff8000822636f0
 x8 : 00000000ffffefff x7 : ffff8000822bb6f0 x6 : 000000000000016e
 x5 : 000000000000016f x4 : 40000000fffff16e x3 : 0000000000000000
 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0008003b8000
 Call trace:
  __mutex_lock+0x690/0x820 (P)
  mutex_lock_nested+0x24/0x30
  imx_hsio_power_on+0x4c/0x764
  phy_power_on+0x7c/0x12c
  imx_sata_enable+0x1ec/0x488
  imx_ahci_probe+0x1a4/0x560
  platform_probe+0x5c/0x98
  really_probe+0xac/0x298
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0xd8/0x160
  __device_attach_driver+0xb8/0x138
  bus_for_each_drv+0x88/0xe8
  __device_attach+0xa0/0x190
  device_initial_probe+0x14/0x20
  bus_probe_device+0xac/0xb0
  deferred_probe_work_func+0x8c/0xc8
  process_one_work+0x1e4/0x440
  worker_thread+0x1b4/0x350
  kthread+0x138/0x214
  ret_from_fork+0x10/0x20

Fixes: 82c56b6 ("phy: freescale: imx8qm-hsio: Add i.MX8QM HSIO PHY driver support")
Signed-off-by: Quanyang Wang <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Aug 28, 2025
[ Upstream commit b640daa ]

Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers
the splat below [0].

rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after
skb_cow_head(), which is illegal as the header could be freed then.

Let's fix it by making oldhdr to a local struct instead of a pointer.

[0]:
[root@fedora net]# ./lwt_dst_cache_ref_loop.sh
...
TEST: rpl (input)
[   57.631529] ==================================================================
BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
Read of size 40 at addr ffff888122bf96d8 by task ping6/1543

CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 torvalds#23 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636)
 kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1))
 __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2))
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 </IRQ>
 <TASK>
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f68cffb2a06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06
RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003
RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4
R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0
 </TASK>

Allocated by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88))
 __alloc_skb (net/core/skbuff.c:669)
 __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1))
 ip6_append_data (net/ipv6/ip6_output.c:1859)
 rawv6_sendmsg (net/ipv6/raw.c:911)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 pskb_expand_head (net/core/skbuff.c:2274)
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1))
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888122bf96c0
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 24 bytes inside of
 freed 704-byte region [ffff888122bf96c0, ffff888122bf9980)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
 ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: a7a29f9 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Aug 28, 2025
[ Upstream commit b640daa ]

Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers
the splat below [0].

rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after
skb_cow_head(), which is illegal as the header could be freed then.

Let's fix it by making oldhdr to a local struct instead of a pointer.

[0]:
[root@fedora net]# ./lwt_dst_cache_ref_loop.sh
...
TEST: rpl (input)
[   57.631529] ==================================================================
BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
Read of size 40 at addr ffff888122bf96d8 by task ping6/1543

CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 torvalds#23 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636)
 kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1))
 __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2))
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174)
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 </IRQ>
 <TASK>
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f68cffb2a06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06
RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003
RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4
R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0
 </TASK>

Allocated by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88))
 __alloc_skb (net/core/skbuff.c:669)
 __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1))
 ip6_append_data (net/ipv6/ip6_output.c:1859)
 rawv6_sendmsg (net/ipv6/raw.c:911)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 1543:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 pskb_expand_head (net/core/skbuff.c:2274)
 rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1))
 rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282)
 lwtunnel_input (net/core/lwtunnel.c:459)
 ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1))
 __netif_receive_skb_one_core (net/core/dev.c:5967)
 process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440)
 __napi_poll.constprop.0 (net/core/dev.c:7452)
 net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643)
 handle_softirqs (kernel/softirq.c:579)
 do_softirq (kernel/softirq.c:480 (discriminator 20))
 __local_bh_enable_ip (kernel/softirq.c:407)
 __dev_queue_xmit (net/core/dev.c:4740)
 ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141)
 ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226)
 ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248)
 ip6_send_skb (net/ipv6/ip6_output.c:1983)
 rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918)
 __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1))
 __x64_sys_sendto (net/socket.c:2231)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888122bf96c0
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 24 bytes inside of
 freed 704-byte region [ffff888122bf96c0, ffff888122bf9980)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002
head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000
head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
                                                    ^
 ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: a7a29f9 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
quasafox pushed a commit to quasafox/linux that referenced this pull request Aug 30, 2025
Fix torvalds#23

WARNING: CPU: 2 PID: 26 at kernel/sched/alt_core.c:6294
sched_cpu_dying.cold+0xc/0xd2
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 2, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 4, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 4, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 5, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
hbiyik pushed a commit to hbiyik/linux that referenced this pull request Sep 8, 2025
sizeof(unsigned long) * 8 is the number of bits in an unsigned long
variable, replace it with BITS_PER_LONG macro to make them simpler.

And fix the warning:
	WARNING: Comparisons should place the constant on the right side of the test
	torvalds#23: FILE: drivers/gpu/drm/panthor/panthor_mmu.c:2696:
	+       if (BITS_PER_LONG < va_bits) {

Signed-off-by: Jinjie Ruan <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
hbiyik pushed a commit to hbiyik/linux that referenced this pull request Sep 8, 2025
sizeof(unsigned long) * 8 is the number of bits in an unsigned long
variable, replace it with BITS_PER_LONG macro to make them simpler.

And fix the warning:
	WARNING: Comparisons should place the constant on the right side of the test
	torvalds#23: FILE: drivers/gpu/drm/panthor/panthor_mmu.c:2696:
	+       if (BITS_PER_LONG < va_bits) {

Signed-off-by: Jinjie Ruan <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 9, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 10, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 11, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 12, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 14, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 15, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 16, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 17, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 18, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 19, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 20, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 21, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: SeongJae Park <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Alex Dubov <[email protected]>
Cc: Alex Willamson <[email protected]>
Cc: Bart van Assche <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Brett Creeley <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: Damien Le Maol <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Lahtinen <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Maxim Levitky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Niklas Cassel <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Robin Murohy <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Shameerali Kolothum Thodi <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yishai Hadas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 21, 2025
When knfsd is a reexport nfs server, it nlm4svc_proc_test() in
calling nlmsvc_testlock() with a lock conflict lock_release_private()
call would end up calling nlmsvc_put_lockowner() and then back in
nlm4svc_proc_test() function there will be another call to
nlmsvc_put_lockowner() for the same owner leading to use-after-free
violation (below).

The problem only arises when the underlying file system has been
re-exported as different paths are taken in vfs_test_lock().
When it's reexport, filp->f_op->lock is set and when vfs_test_lock()
is done fl_lmops pointer is non-null. When it's regular export,
vfs_test_lock() calls posix_test_lock() which ends up calling
locks_copy_conflock() and it copies NULL into fl_lmops and then
calling into lock_release_private() does not call
nlmsvc_put_lockowner().

The proposed solution is to intentionally clear fl_lmops pointer to
make sure that if there is a conflict (be it a local file system
or reexport), lock_release_private() would not call
nlmsvc_put_lockowner().

kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nlmsvc_put_lockowner+0x30/0x250 [lockd]
kernel: Read of size 4 at addr ffff0000bf3bca10 by task lockd/6092
kernel:
kernel: CPU: 0 UID: 0 PID: 6092 Comm: lockd Kdump: loaded Not tainted 6.18.0-rc1+ torvalds#23 PREEMPT(voluntary)
kernel: Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024
kernel: Call trace:
kernel:  show_stack+0x34/0x98 (C)
kernel:  dump_stack_lvl+0x80/0xa8
kernel:  print_address_description.constprop.0+0x90/0x310
kernel:  print_report+0x108/0x1f8
kernel:  kasan_report+0xc8/0x120
kernel:  kasan_check_range+0xe8/0x190
kernel:  __kasan_check_read+0x20/0x30
kernel:  nlmsvc_put_lockowner+0x30/0x250 [lockd]
kernel:  __nlm4svc_proc_test+0x2a8/0x318 [lockd]
kernel:  nlm4svc_proc_test+0x44/0xb78 [lockd]
kernel:  nlmsvc_dispatch+0xb0/0x200 [lockd]
kernel:  svc_process_common+0xb20/0x17b0 [sunrpc]
kernel:  svc_process+0x414/0x900 [sunrpc]
kernel:  svc_handle_xprt+0x5c8/0xd60 [sunrpc]
kernel:  svc_recv+0x1a4/0x520 [sunrpc]
kernel:  lockd+0x154/0x298 [lockd]
kernel:  kthread+0x2f8/0x398
kernel:  ret_from_fork+0x10/0x20
kernel:
kernel: Allocated by task 6092:
kernel:  kasan_save_stack+0x3c/0x70
kernel:  kasan_save_track+0x20/0x40
kernel:  kasan_save_alloc_info+0x40/0x58
kernel:  __kasan_kmalloc+0xd4/0xd8
kernel:  __kmalloc_cache_noprof+0x1a8/0x5c0
kernel:  nlmsvc_locks_init_private+0xe4/0x520 [lockd]
kernel:  nlm4svc_retrieve_args+0x38c/0x530 [lockd]
kernel:  __nlm4svc_proc_test+0x194/0x318 [lockd]
kernel:  nlm4svc_proc_test+0x44/0xb78 [lockd]
kernel:  nlmsvc_dispatch+0xb0/0x200 [lockd]
kernel:  svc_process_common+0xb20/0x17b0 [sunrpc]
kernel:  svc_process+0x414/0x900 [sunrpc]
kernel:  svc_handle_xprt+0x5c8/0xd60 [sunrpc]
kernel:  svc_recv+0x1a4/0x520 [sunrpc]
kernel:  lockd+0x154/0x298 [lockd]
kernel:  kthread+0x2f8/0x398
kernel:  ret_from_fork+0x10/0x20
kernel:
kernel: Freed by task 6092:
kernel:  kasan_save_stack+0x3c/0x70
kernel:  kasan_save_track+0x20/0x40
kernel:  __kasan_save_free_info+0x4c/0x80
kernel:  __kasan_slab_free+0x88/0xc0
kernel:  kfree+0x110/0x480
kernel:  nlmsvc_put_lockowner+0x1b4/0x250 [lockd]
kernel:  nlmsvc_put_owner+0x18/0x30 [lockd]
kernel:  locks_release_private+0x190/0x2a8
kernel:  nlmsvc_testlock+0x2e0/0x648 [lockd]
kernel:  __nlm4svc_proc_test+0x244/0x318 [lockd]
kernel:  nlm4svc_proc_test+0x44/0xb78 [lockd]
kernel:  nlmsvc_dispatch+0xb0/0x200 [lockd]
kernel:  svc_process_common+0xb20/0x17b0 [sunrpc]
kernel:  svc_process+0x414/0x900 [sunrpc]
kernel:  svc_handle_xprt+0x5c8/0xd60 [sunrpc]
kernel:  svc_recv+0x1a4/0x520 [sunrpc]
kernel:  lockd+0x154/0x298 [lockd]
kernel:  kthread+0x2f8/0x398
kernel:  ret_from_fork+0x10/0x20
kernel:
kernel: The buggy address belongs to the object at ffff0000bf3bca00
        which belongs to the cache kmalloc-64 of size 64
kernel: The buggy address is located 16 bytes inside of
        freed 64-byte region [ffff0000bf3bca00, ffff0000bf3bca40)
kernel:
kernel: The buggy address belongs to the physical page:
kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13f3bc
kernel: flags: 0x2fffff00000000(node=0|zone=2|lastcpupid=0xfffff)
kernel: page_type: f5(slab)
kernel: raw: 002fffff00000000 ffff0000800028c0 dead000000000122 0000000000000000
kernel: raw: 0000000000000000 0000000080200020 00000000f5000000 0000000000000000
kernel: page dumped because: kasan: bad access detected
kernel:
kernel: Memory state around the buggy address:
kernel:  ffff0000bf3bc900: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
kernel:  ffff0000bf3bc980: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
kernel: >ffff0000bf3bca00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
kernel:                          ^
kernel:  ffff0000bf3bca80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
kernel:  ffff0000bf3bcb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
kernel: ==================================================================
kernel: Disabling lock debugging due to kernel taint
kernel: AGLO: nlmsvc_put_lockowner 00000000028055fb count=0
kernel: CPU: 0 UID: 0 PID: 6092 Comm: lockd Kdump: loaded Tainted: G    B               6.18.0-rc1+ torvalds#23 PREEMPT(voluntary)
kernel: Tainted: [B]=BAD_PAGE
kernel: Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024
kernel: Call trace:
kernel:  show_stack+0x34/0x98 (C)
kernel:  dump_stack_lvl+0x80/0xa8
kernel:  dump_stack+0x1c/0x30
kernel:  nlmsvc_put_lockowner+0x7c/0x250 [lockd]
kernel:  __nlm4svc_proc_test+0x2a8/0x318 [lockd]
kernel:  nlm4svc_proc_test+0x44/0xb78 [lockd]
kernel:  nlmsvc_dispatch+0xb0/0x200 [lockd]
kernel:  svc_process_common+0xb20/0x17b0 [sunrpc]
kernel:  svc_process+0x414/0x900 [sunrpc]
kernel:  svc_handle_xprt+0x5c8/0xd60 [sunrpc]
kernel:  svc_recv+0x1a4/0x520 [sunrpc]
kernel:  lockd+0x154/0x298 [lockd]
kernel:  kthread+0x2f8/0x398
kernel:  ret_from_fork+0x10/0x20
kernel: ------------[ cut here ]------------
kernel: refcount_t: underflow; use-after-free.
kernel: WARNING: CPU: 0 PID: 6092 at lib/refcount.c:87 refcount_dec_not_one+0x198/0x1b0
kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd nfsv3 nfs_acl nfs lockd grace nfs_localio netfs ext4 crc16 mbcache jbd2 overlay uinput snd_seq_dummy snd_hrtimer qrtr rfkill vfat fat uvcvideo snd_hda_codec_generic videobuf2_vmalloc videobuf2_memops uvc snd_hda_intel videobuf2_v4l2 videobuf2_common snd_intel_dspcfg videodev snd_hda_codec snd_hda_core mc snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sg loop auth_rpcgss vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs 8021q garp stp llc mrp nvme nvme_core nvme_keyring nvme_auth ghash_ce hkdf e1000e sr_mod cdrom vmwgfx drm_ttm_helper ttm sunrpc dm_mirror dm_region_hash dm_log iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse dm_multipath dm_mod nfnetlink
kernel: CPU: 0 UID: 0 PID: 6092 Comm: lockd Kdump: loaded Tainted: G    B               6.18.0-rc1+ torvalds#23 PREEMPT(voluntary)
kernel: Tainted: [B]=BAD_PAGE
kernel: Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024
kernel: pstate: 6140000 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
kernel: pc : refcount_dec_not_one+0x198/0x1b0
kernel: lr : refcount_dec_not_one+0x198/0x1b0
kernel: sp : ffff80008a627930
kernel: x29: ffff80008a627990 x28: ffff0000bf3bca00 x27: ffff0000ba5c7000
kernel: x26: 1fffe000191eeb84 x25: 1ffff000114c4f48 x24: ffff0000c8f75c24
kernel: x23: 0000000000000007 x22: ffff80008a627950 x21: 1ffff000114c4f26
kernel: x20: 00000000ffffffff x19: ffff0000bf3bca10 x18: 0000000000000310
kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
kernel: x14: 0000000000000000 x13: 0000000000000001 x12: ffff60004fd90aa3
kernel: x11: 1fffe0004fd90aa2 x10: ffff60004fd90aa2 x9 : dfff800000000000
kernel: x8 : 00009fffb026f55e x7 : ffff00027ec85513 x6 : 0000000000000001
kernel: x5 : ffff00027ec85510 x4 : ffff60004fd90aa3 x3 : ffff800080787bc0
kernel: x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000a75a8000
kernel: Call trace:
kernel:  refcount_dec_not_one+0x198/0x1b0 (P)
kernel:  refcount_dec_and_lock+0x1c/0xb8
kernel:  nlmsvc_put_lockowner+0x9c/0x250 [lockd]
kernel:  __nlm4svc_proc_test+0x2a8/0x318 [lockd]
kernel:  nlm4svc_proc_test+0x44/0xb78 [lockd]
kernel:  nlmsvc_dispatch+0xb0/0x200 [lockd]
kernel:  svc_process_common+0xb20/0x17b0 [sunrpc]
kernel:  svc_process+0x414/0x900 [sunrpc]
kernel:  svc_handle_xprt+0x5c8/0xd60 [sunrpc]
kernel:  svc_recv+0x1a4/0x520 [sunrpc]
kernel:  lockd+0x154/0x298 [lockd]
kernel:  kthread+0x2f8/0x398
kernel:  ret_from_fork+0x10/0x20
kernel: ---[ end trace 0000000000000000 ]---

Signed-off-by: Olga Kornievskaia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants