tst_huge hangs with memory over 1GB.

This issue was discovered by @wkozaczuk and reported on the mailing list. It is likely a recent regression, or an old bug which for an unknown reason got exposed by recent changes to the memory layout.

Running tst-huge.so with 1GB of memory works fine:
```
scripts/build image=tests
scripts/run.py -m1G -e tests/tst-huge.so
```
However, giving it bigger memory, like the default 2GB, results in a hang.
With gdb we can see the test thread hung on a deadlock:
```
(gdb) connect
(gdb) osv syms
(gdb) osv info threads
...
 253 (0xffff80000255b040) tests/tst-huge. cpu1 status::waiting waitqueue::wait(lockfree::mutex&) at core/waitqueue.cc:56 vruntime  2.75268e-19
(gdb) osv thread
(gdb) bt
#0  sched::thread::switch_to (this=0xffff800000a6a040, 
    this@entry=0xffff800000a81040) at arch/x64/arch-switch.hh:108
#1  0x00000000403f31c4 in sched::cpu::reschedule_from_interrupt (
    this=0xffff800000053040, called_from_yield=called_from_yield@entry=false, 
    preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2  0x00000000403f36ac in sched::cpu::schedule () at include/osv/sched.hh:1310
#3  0x00000000403f3d72 in sched::thread::wait (
    this=this@entry=0xffff80000255b040) at core/sched.cc:1214
#4  0x000000004041b0e2 in sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<waitqueue> > (mtx=...) at include/osv/mutex.h:41
#5  sched::thread::wait_for<waitqueue&> (mtx=...) at include/osv/sched.hh:1220
#6  waitqueue::wait (this=this@entry=0x409a3550 <mmu::vma_list_mutex+48>, 
    mtx=...) at core/waitqueue.cc:56
#7  0x00000000403d6ea3 in rwlock::reader_wait_lockable (this=<optimized out>)
    at core/rwlock.cc:174
#8  rwlock::rlock (this=this@entry=0x409a3520 <mmu::vma_list_mutex>)
    at core/rwlock.cc:29
#9  0x0000000040341e0f in rwlock_for_read::lock (
    this=0x409a3520 <mmu::vma_list_mutex>) at include/osv/rwlock.h:113
#10 std::lock_guard<rwlock_for_read&>::lock_guard (__m=..., 
    this=<synthetic pointer>) at /usr/include/c++/8/bits/std_mutex.h:162
#11 lock_guard_for_with_lock<rwlock_for_read&>::lock_guard_for_with_lock (
    lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#12 mmu::vm_fault (addr=18446603337326391296, addr@entry=18446603337326395384, 
    ef=ef@entry=0xffff800002560068) at core/mmu.cc:1334
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
#14 <signal handler called>
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
#16 0x00000000403e337c in memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (header=..., __closure=<synthetic pointer>)
    at core/mempool.cc:751
#17 memory::page_range_allocator::<lambda(memory::page_range&)>::operator() (
    header=..., __closure=<synthetic pointer>) at core/mempool.cc:736
#18 memory::page_range_allocator::for_each<memory::page_range_allocator::alloc_aligned(size_t, size_t, size_t, bool)::<lambda(memory::page_range&)> > (f=..., 
    min_order=<optimized out>, this=0x409bb300 <memory::free_page_ranges>)
    at core/mempool.cc:809
#19 memory::page_range_allocator::alloc_aligned (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, 
    size=size@entry=2097152, offset=offset@entry=0, 
    alignment=alignment@entry=2097152, fill=fill@entry=true)
    at core/mempool.cc:736
#20 0x00000000403e3c90 in memory::alloc_huge_page (N=N@entry=2097152)
    at core/mempool.cc:1601
#21 0x000000004034718e in mmu::uninitialized_anonymous_page_provider::map (
    this=0x408681b0 <mmu::page_allocator_init>, offset=83886080, ptep=..., 
    pte=..., write=<optimized out>) at include/osv/mmu-defs.hh:219
#22 0x000000004034c79a in mmu::populate<(mmu::account_opt)1>::page<1> (
    offset=83886080, ptep=..., this=0x2000001ffd80)
    at include/osv/mmu-defs.hh:235
#23 mmu::page<mmu::populate<>, 1> (ptep=..., offset=83886080, pops=...)
    at core/mmu.cc:311
#24 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 2>::operator() (
    base_virt=35185397596160, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:437
#25 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::map_range<2> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=132120576, vcur=<optimized out>) at core/mmu.cc:399
#26 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 3>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#27 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::map_range<3> (
    this=<synthetic pointer>, ptep=..., base_virt=35184372088832, slop=4096, 
    page_mapper=..., size=134217728, vcur=<optimized out>) at core/mmu.cc:399
#28 mmu::map_level<mmu::populate<(mmu::account_opt)1>, 4>::operator() (
    base_virt=35184372088832, parent=..., this=<synthetic pointer>)
    at core/mmu.cc:449
#29 mmu::map_range<mmu::populate<(mmu::account_opt)1> > (
    vma_start=vma_start@entry=35185313710080, 
    vstart=vstart@entry=35185313710080, size=<optimized out>, page_mapper=..., 
    slop=slop@entry=4096) at core/mmu.cc:354
#30 0x000000004034cfc4 in mmu::operate_range<mmu::populate<(mmu::account_opt)1> > (size=<optimized out>, start=0x200038200000, vma_start=<optimized out>, 
    mapper=...) at core/mmu.cc:801
#31 mmu::vma::operate_range<mmu::populate<(mmu::account_opt)1> > (size=3, 
    addr=0x200038200000, mapper=..., this=0xffffa000481ff300)
    at core/mmu.cc:1412
#32 mmu::populate_vma<(mmu::account_opt)1> (vma=vma@entry=0xffffa000481ff300, 
    v=v@entry=0x200038200000, size=size@entry=134217728, 
    write=write@entry=false) at core/mmu.cc:1206
#33 0x0000000040345809 in mmu::map_anon (addr=addr@entry=0x0, 
    size=size@entry=134217728, flags=flags@entry=2, perm=perm@entry=3)
    at core/mmu.cc:1222
#34 0x0000000040456dad in mmap (addr=addr@entry=0x0, 
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)
    length=length@entry=134217728, prot=prot@entry=3, flags=flags@entry=32802, 
    fd=fd@entry=-1, offset=offset@entry=0) at libc/mman.cc:156
#35 0x00001000000685d4 in exhaust_memory (size=size@entry=134217728)
    at /home/nyh/osv.tmp2/osv/tests/tst-huge.cc:31
#36 0x000010000006820b in main (argc=<optimized out>, argv=<optimized out>)
```

The deadlock happens because map_anon (frame 33) takes `vma_list_mutex.for_write()` and then we get a page fault, and page_fault (frame 13) takes the same rwlock for reading. And apparently our rwlock implementation isn't recursive. Which maybe we should reconsider, but this isn't the real bug here - the real bug is why we got this page fault in the first place.

The page fault happens on address 0xffff800041dffff8:
```
(gdb) fr 13
#13 0x000000004039cdf6 in page_fault (ef=0xffff800002560068)
    at arch/x64/mmu.cc:38
38	        mmu::vm_fault(addr, ef);
(gdb) p (void*) addr
$2 = (void *) 0xffff800041dffff8
```

How is this address calculated?
```
(gdb) fr 15
#15 0x00000000403e5b64 in memory::page_range_allocator::insert<true> (
    this=this@entry=0x409bb300 <memory::free_page_ranges>, pr=...)
    at core/mempool.cc:575
575	    void insert(page_range& pr) {
(gdb) l
570	        return size;
571	    }
572	
573	private:
574	    template<bool UseBitmap = true>
575	    void insert(page_range& pr) {
576	        auto addr = static_cast<void*>(&pr);
577	        auto pr_end = static_cast<page_range**>(addr + pr.size - sizeof(page_range**));
578	        *pr_end = &pr;
579	        auto order = ilog2(pr.size / page_size);
(gdb) p addr
$3 = (void *) 0xffff800002561000
(gdb) p pr_size
No symbol "pr_size" in current context.
(gdb) p pr.size
$4 = 1066004480
(gdb) p pr_end
$5 = (memory::page_range **) 0xffff800041dffff8
```

So the problematic address is pr_end. It is calculated by taking pr and adding to it 1066004480, which is almost one gigabyte. Does this make sense? Do we really have such a huge continguous (in physical memory...) allocation? We need to debug this further.

Another interesting observation: running tst-huge.so with -m2G or even -m1.1G fails as above. But if I run it with sizes only a tiny bit over 1G, it fails in more spectacular ways:
```
$ scripts/run.py -m1.01G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
Assertion failed: ef->rflags & processor::rflags_if (arch/x64/mmu.cc: page_fault: 34)
Halting.
```
```
$ scripts/run.py -m1.02G -e tests/tst-huge.so
OSv v0.53.0-88-g5377a50b
page fault outside application, addr: 0xffff800067a13000
page fault outside application, addr: 0x0000000000000000
exception nested too deeplyHalting.
````
I'll open a new issue about this, because it happens before running the test, so it's not at all specific to tst-huge.so - although possibly it will end up that both issues stem from the same bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

tst_huge hangs with memory over 1GB. #1049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

tst_huge hangs with memory over 1GB. #1049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions