-
Notifications
You must be signed in to change notification settings - Fork 581
Description
When checking out a fresh copy from main
, the following F2 tests fail (all related to RMW):
Test Duration Traits Error Message
HotColdTests/HotColdParameterizedTestParam.Rmw/0 [(2048, false, false)] Failed < 1 ms GoogleTestAdapterTestType [Parameterized] !! This test has probably CRASHED !! Test output: [0000.49112300]{ 0}::INFO::f2_test.cc:620: TEST: Index size: 2048 Auto compaction: 0 Use Read Cache: 0 [0000.49492600]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 192.000 MiB [0000.49524100]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.400 [0000.49530700]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.49535100]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 2 [0000.57404200]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.191032100]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.191085300]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.191628700]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 192.000 MiB [0000.191655000]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.000 [0000.191666100]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.191672100]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 2 [0000.198695200]{ 0}::DEBUG::cold_index.h:79: ColdIndex will be stored @ test_f2_store\cold_\cold-index\ [0000.199561100]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.199574300]{ 0}::DEBUG::cold_index.h:129: ColdIndex: 2048 HT entries -> 256 MiB in-memory [~153 MiB mutable] [0000.199899800]{ 0}::DEBUG::persistent_memory_malloc.h:269: Log size: 256.000 MiB [0000.199917800]{ 0}::DEBUG::persistent_memory_malloc.h:270: Mutable fraction: 0.600 [0000.199924900]{ 0}::DEBUG::persistent_memory_malloc.h:271: Pre-allocate: FALSE [0000.199930000]{ 0}::DEBUG::persistent_memory_malloc.h:592: Num mutable_pages = 4 [0000.210031700]{ 0}::DEBUG::faster.h:160: Hash Index Size: 2048 [0000.348919600]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.348975000]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.348985700]{ 0}::DEBUG::faster.h:166: Read cache is DISABLED [0000.348990000]{ 0}::DEBUG::faster.h:171: Auto compaction is DISABLED [0000.349001900]{ 0}::INFO::f2.h:224: Automatic F2 compaction is disabled [0028.2772776324]{ 0}::DEBUG::f2.h:750: Compact HOT: {0.07 GB} {Goal 0.07 GB} [64 79606064] -> [64 79606064]
HotColdTests/HotColdParameterizedTestParam.Rmw/1 [(2048, false, true)] Failed Stale 582 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/2 [(2048, true, false)] Failed Stale 575 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/3 [(2048, true, true)] Failed Stale 584 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/4 [(4194304, false, false)] Failed Stale 653 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/5 [(4194304, false, true)] Failed Stale 658 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/6 [(4194304, true, false)] Failed Stale 644 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
HotColdTests/HotColdParameterizedTestParam.Rmw/7 [(4194304, true, true)] Failed Stale 666 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
Test Duration Traits Error Message
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/0 [(true, false, 1)] Failed Stale 27.8 sec GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/1 [(true, false, 8)] Failed Stale 189 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
CompactLookupOnDiskTests/CompactLookupParameterizedOnDiskTestFixture.OnDiskRmw/2 [(true, true, 8)] Failed Stale 193 ms GoogleTestAdapterTestType [Parameterized] Expected equality of these values: Status::Ok Which is: \0 result Which is:
I modified one of the tests at the line below to try to help it pass. It seems that the purpose of the test was to prove that N bytes of records stayed in memory until more RMW ops cycled them out. The test was failing because the ops were immediately pending and not OK
:
Line 640 in 7f71289
uint32_t num_records = 20000; // ~160 MB of data |
I changed the value to 10,000
.
Re-running the test with the lower value causes an infinite compaction loop that I have also observed in my own benchmarking of F2. I really don't know the deep inner mechanics yet but these are my observations:
- In faster.h,
pages_available
is always true. - In LightEpoch, the calculation to determine the next safe epoch to flush becomes disconnected from the pending operations. When this happens, there's always a
PENDING_GC
and sometimes aREST
status operation with an epoch > the safe minimum epoch that can be reclaimed.
To expand on the second point, the calculation for the safe epoch becomes stuck, never advances, and the final pending items, which appear to have their final completed status set to pending garbage collection, remain permanently out of reach of LightEpoch and cannot be completed. I suspect this is why the pages_available
loop happens.
I wish I could submit a PR, but these are deep / central mechanics and I'm not sure how long it will be before I can solve this myself.