Skip to content

tst_huge hangs with memory over 1GB. #1049

@nyh

Description

@nyh

This issue was discovered by @wkozaczuk and reported on the mailing list. It is likely a recent regression, or an old bug which for an unknown reason got exposed by recent changes to the memory layout.

Running tst-huge.so with 1GB of memory works fine:

scripts/build image=tests
scripts/run.py -m1G -e tests/tst-huge.so

However, giving it bigger memory, like the default 2GB, results in a hang.
With gdb we can see the test thread hung on a deadlock:

(gdb) connect
(gdb) osv syms
(gdb) osv info threads
...
 253 (0xffff80000255b040) tests/tst-huge. cpu1 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime  2.75268e-19
(gdb) osv thread
(gdb) bt
#0  sched::thread::switch_to (this=0xffff800000a6a040, 
    this@entry=0xffff800000a81040) at arch/x64/arch-switch.hh:108
#1  0x00000000403f31c4 in sched::cpu::reschedule_from_interrupt (
    this=0xffff800000053040, called_from_yield=called_from_yield@entry=false, 
    preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2  0x00000000403f36ac in sched::cpu::schedule () at include/osv/sched.hh:1310
#3  0x00000000403f3d72 in sched::thread::wait (
    this=this@entry=0xffff80000255b040) at core/sched.cc:1214
#4  0x000000004041b0e2 in sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<waitqueue> > (mtx=...) at include/osv/mutex.h:41
#5  sched::thread::wait_for<waitqueue&> (mtx=...) at include/osv/sched.hh:1220
#6  waitqueue::wait (this=this@entry=0x409a3550 <mmu::vma_list_mutex+48>, 
    mtx=...) at core/waitqueue.cc:56
#7  0x00000000403d6ea3 in rwlock::reader_wait_lockable (this=<optimized out>)
    at core/rwlock.cc:174
#8  rwlock::rlock (this=this@entry=0x409a3520 <mmu::vma_list_mutex>)
    at core/rwlock.cc:29
#9  0x0000000040341e0f in rwlock_for_read::lock (
    this=0x409a3520 <mmu::vma_list_mutex>) at include/osv/rwlock.h:113
#10 std::lock_guard<rwlock_for_read&>::lock_guard (__m=..., 
    this=<synthetic pointer>) at /usr/include/c++/8/bits/std_mutex.h:162
#11 lock_guard_for_with_lock<rwlock_for_read&>::lock_guard_for_with_lock (
    lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#12 mmu::vm_fault (addr=18446603337326391296, addr@entry=18446603337326395384, 
    ef=ef@entry=0xffff800002560068) at core/mmu.cc:1334
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
#14 <signal handler called>
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
#16 0x00000000403e337c in memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (header=..., __closure=<synthetic pointer>)
    at core/mempool.cc:751
#17 memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (
    header=..., __closure=<synthetic pointer>) at core/mempool.cc:736
#18 memory::page_range_allocator::for_each<memory::page_range_allocator::alloc_aligned(size_t, size_t, size_t, bool)::<lambda(memory::page_range&)> > (f=..., 
    min_order=<optimized out>, this=0x409bb300 <memory::free_page_ranges>)
    at core/mempool.cc:809
#19 memory::page_range_allocator::alloc_aligned (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, 
    size=size@entry=2097152, offset=offset@entry=0, 
    alignment=alignment@entry=2097152, fill=fill@entry=true)
    at core/mempool.cc:736
#20 0x00000000403e3c90 in memory::alloc_huge_page (N=N@entry=2097152)
    at core/mempool.cc:1601
#21 0x000000004034718e in mmu::uninitialized_anonymous_page_provider::map (
    this=0x408681b0 <mmu::page_allocator_init>, offset=83886080, ptep=..., 
    pte=..., write=<optimized out>) at include/osv/mmu-defs.hh:219
#22 0x000000004034c79a in mmu::populate<(mmu::account_opt)1>::page<1> (
    offset=83886080, ptep=..., this=0x2000001ffd80)
    at include/osv/mmu-defs.hh:235
#23 mmu::page<mmu::populate<>, 1> (ptep=..., offset=83886080, pops=...)
    at core/mmu.cc:311
#24 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 2>::operator() (
    base_virt=35185397596160, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:437
#25 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::map_range<2> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=132120576, vcur=<optimized out>) at core/mmu.cc:399
#26 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#27 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::map_range<3> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=134217728, vcur=<optimized out>) at core/mmu.cc:399
#28 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#29 mmu::map_range<mmu::populate<(mmu::account_opt)1> > (
    vma_start=vma_start@entry=35185313710080, 
    vstart=vstart@entry=35185313710080, size=<optimized out>, page_mapper=..., 
    slop=slop@entry=4096) at core/mmu.cc:354
#30 0x000000004034cfc4 in mmu::operate_range<mmu::populate<(mmu::account_opt)1> > (size=<optimized out>, start=0x200038200000, vma_start=<optimized out>, 
    mapper=...) at core/mmu.cc:801
#31 mmu::vma::operate_range<mmu::populate<(mmu::account_opt)1> > (size=3, 
    addr=0x200038200000, mapper=..., this=0xffffa000481ff300)
    at core/mmu.cc:1412
#32 mmu::populate_vma<(mmu::account_opt)1> (vma=vma@entry=0xffffa000481ff300, 
    v=v@entry=0x200038200000, size=size@entry=134217728, 
    write=write@entry=false) at core/mmu.cc:1206
#33 0x0000000040345809 in mmu::map_anon (addr=addr@entry=0x0, 
    size=size@entry=134217728, flags=flags@entry=2, perm=perm@entry=3)
    at core/mmu.cc:1222
#34 0x0000000040456dad in mmap (addr=addr@entry=0x0, 
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)

The deadlock happens because map_anon (frame 33) takes vma_list_mutex.for_write() and then we get a page fault, and page_fault (frame 13) takes the same rwlock for reading. And apparently our rwlock implementation isn't recursive. Which maybe we should reconsider, but this isn't the real bug here - the real bug is why we got this page fault in the first place.

The page fault happens on address 0xffff800041dffff8:

(gdb) fr 13
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
38	        mmu::vm_fault(addr, ef);
(gdb) p (void*) addr
$2 = (void *) 0xffff800041dffff8

How is this address calculated?

(gdb) fr 15
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
575	    void insert(page_range& pr) {
(gdb) l
570	        return size;
571	    }
572	
573	private:
574	    template<bool UseBitmap = true>
575	    void insert(page_range& pr) {
576	        auto addr = static_cast<void*>(&pr);
577	        auto pr_end = static_cast<page_range**>(addr + pr.size - sizeof(page_range**));
578	        *pr_end = &pr;
579	        auto order = ilog2(pr.size / page_size);
(gdb) p addr
$3 = (void *) 0xffff800002561000
(gdb) p pr_size
No symbol "pr_size" in current context.
(gdb) p pr.size
$4 = 1066004480
(gdb) p pr_end
$5 = (memory::page_range **) 0xffff800041dffff8

So the problematic address is pr_end. It is calculated by taking pr and adding to it 1066004480, which is almost one gigabyte. Does this make sense? Do we really have such a huge continguous (in physical memory...) allocation? We need to debug this further.

Another interesting observation: running tst-huge.so with -m2G or even -m1.1G fails as above. But if I run it with sizes only a tiny bit over 1G, it fails in more spectacular ways:

$ scripts/run.py -m1.01G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
Assertion failed: ef->rflags & processor::rflags_if (arch/x64/mmu.cc: page_fault: 34)
Halting.
$ scripts/run.py -m1.02G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
page fault outside application, addr: 0xffff800067a13000
page fault outside application, addr: 0x0000000000000000
exception nested too deeplyHalting.

I'll open a new issue about this, because it happens before running the test, so it's not at all specific to tst-huge.so - although possibly it will end up that both issues stem from the same bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions