Skip to content

F2 - Infinite compaction bug #944

@TechnikEmpire

Description

@TechnikEmpire

When checking out a fresh copy from main, the following F2 tests fail (all related to RMW):

Test	Duration	Traits	Error Message
HotColdTests/HotColdParameterizedTestParam.Rmw/0 [(2048, false, false)] Failed	< 1 ms	GoogleTestAdapterTestType [Parameterized]	 !! This test has probably CRASHED !! Test output:  [0000.49112300]{ 0}::INFO::f2_test.cc:620:  TEST: Index size: 2048	Auto compaction: 0	Use Read Cache: 0 [0000.49492600]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 192.000 MiB [0000.49524100]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.400 [0000.49530700]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.49535100]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 2 [0000.57404200]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.191032100]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.191085300]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.191628700]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 192.000 MiB [0000.191655000]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.000 [0000.191666100]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.191672100]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 2 [0000.198695200]{ 0}::DEBUG::cold_index.h:79: ColdIndex will be stored @ test_f2_store\cold_\cold-index\ [0000.199561100]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.199574300]{ 0}::DEBUG::cold_index.h:129: ColdIndex: 2048 HT entries -> 256 MiB in-memory [~153 MiB mutable] [0000.199899800]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 256.000 MiB [0000.199917800]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.600 [0000.199924900]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.199930000]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 4 [0000.210031700]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.348919600]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.348975000]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.348985700]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.348990000]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.349001900]{ 0}::INFO::f2.h:224: Automatic F2 compaction is disabled [0028.2772776324]{ 0}::DEBUG::f2.h:750: Compact HOT: {0.07 GB} {Goal 0.07 GB} [64 79606064] -> [64 79606064] 
HotColdTests/HotColdParameterizedTestParam.Rmw/1 [(2048, false, true)] Failed Stale	582 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/2 [(2048, true, false)] Failed Stale	575 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/3 [(2048, true, true)] Failed Stale	584 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/4 [(4194304, false, false)] Failed Stale	653 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/5 [(4194304, false, true)] Failed Stale	658 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/6 [(4194304, true, false)] Failed Stale	644 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/7 [(4194304, true, true)] Failed Stale	666 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:

Test	Duration	Traits	Error Message
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/0 [(true, false, 1)] Failed Stale	27.8 sec	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/1 [(true, false, 8)] Failed Stale	189 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/2 [(true, true, 8)] Failed Stale	193 ms	GoogleTestAdapterTestType [Parameterized]	 Expected equality of these values:   Status::Ok     Which is: \0   result     Which is:

I modified one of the tests at the line below to try to help it pass. It seems that the purpose of the test was to prove that N bytes of records stayed in memory until more RMW ops cycled them out. The test was failing because the ops were immediately pending and not OK:

uint32_t num_records = 20000; // ~160 MB of data

I changed the value to 10,000.

Re-running the test with the lower value causes an infinite compaction loop that I have also observed in my own benchmarking of F2. I really don't know the deep inner mechanics yet but these are my observations:

  • In faster.h, pages_available is always true.
  • In LightEpoch, the calculation to determine the next safe epoch to flush becomes disconnected from the pending operations. When this happens, there's always a PENDING_GC and sometimes a REST status operation with an epoch > the safe minimum epoch that can be reclaimed.

To expand on the second point, the calculation for the safe epoch becomes stuck, never advances, and the final pending items, which appear to have their final completed status set to pending garbage collection, remain permanently out of reach of LightEpoch and cannot be completed. I suspect this is why the pages_available loop happens.

I wish I could submit a PR, but these are deep / central mechanics and I'm not sure how long it will be before I can solve this myself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions