Skip to content

Conversation

@benjeffery
Copy link
Member

@jeromekelleher This seems to be the most straightforward way of supporting internal nodes with missingness - if we have a specific set of nodes we traverse to check which are internal and missing.

@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.80%. Comparing base (f43ab1f) to head (dc9e33b).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
c/tskit/genotypes.c 95.83% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3301      +/-   ##
==========================================
+ Coverage   89.79%   89.80%   +0.01%     
==========================================
  Files          29       29              
  Lines       31008    31042      +34     
  Branches     5673     5681       +8     
==========================================
+ Hits        27843    27877      +34     
  Misses       1778     1778              
  Partials     1387     1387              
Flag Coverage Δ
c-tests 86.88% <95.83%> (+0.02%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests 86.97% <ø> (ø)
python-tests 98.84% <ø> (+0.01%) ⬆️
python-tests-no-jit 33.60% <ø> (ø)
python-tests-numpy1 50.18% <ø> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/core.c 95.58% <ø> (-0.02%) ⬇️
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/genotypes.c 85.97% <95.83%> (+1.22%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really getting this, so I think we need to move up to Python first.

}
self->sample_is_present
= tsk_malloc(num_samples_alloc * sizeof(*self->sample_is_present));
if (self->sample_is_present == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usual practise is to bundle multiple mem error checks together

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

tsk_memset(present, 0, self->num_samples * sizeof(*present));

for (root = left_child[N]; root != TSK_NULL; root = right_sib[root]) {
stack[++stack_top] = root;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use return value of increment operators, separate into two statements. Please use the existing patterns for these kinds of traversals so I don't have to think about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@benjeffery
Copy link
Member Author

Before I go ahead with the Python, I'll try explaining (100% human text!)?

The reason that internal samples are problematic is that, unlike true leaf sample nodes, they are sometimes absent from a tree. This means we don't visit them, and they get the ancestral state rather than the missingness they should have had. To fix this, when we have alt_samples we do a simple traversal to find the unvisited samples. Once we have them, we can mark them as missing properly. The final logic for a sample being masked as missing is "unvisited or (a sample that has no parent and no children)"

@jeromekelleher
Copy link
Member

That makes sense. What's not clear to me is whether this traversal is done for each site independently, or it's something we're amortising across the sites in a tree. So, for each tree, we could build the list of missing samples at the cost of one traversal (which would be fine), right?

@benjeffery benjeffery force-pushed the isolated-internal-check branch from b6b2bf0 to b5160d5 Compare October 23, 2025 11:56
@benjeffery benjeffery force-pushed the isolated-internal-check branch from b5160d5 to dc9e33b Compare October 23, 2025 11:57
@benjeffery
Copy link
Member Author

Yeah, was trying to keep it clean, but I thought about it and in dc9e33b we keep the missing samples and only update if the tree has changed.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good. See comment about needing the traversal?

present[j]
= present[j]
&& !((flags[u] & TSK_NODE_IS_SAMPLE) != 0
&& self->tree.parent[u] == TSK_NULL && left_child[u] == TSK_NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get a restrict pointer for self->parent also

if (!impute_missing) {
if (self->alt_samples != NULL
&& self->missingmess_cache_tree_index != self->tree.index) {
tsk_variant_update_missing_cache(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice


tsk_memset(present, 0, self->num_samples * sizeof(*present));

stack_top = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a bit dumb, but why do we need to do the traversal? Can't we just do the num_samples loop below, check each sample to see if it's isolated?

@jeromekelleher
Copy link
Member

So might be easier to take this to the final stages by making some Python test cases that'll clearly pick out the different corner cases for discussion.

@benjeffery
Copy link
Member Author

So might be easier to take this to the final stages by making some Python test cases

Agreed, I've thoroughly confused myself and am cracking out the ascii art trees for salvation.

@benjeffery
Copy link
Member Author

I think I've finally understood this now. I was getting hung up on roots and how they fit into the missingness classification but a root can never be a non-sample node without parents or children (i.e. isolated) as roots have to be reachable up the tree from samples. I think this means we don't need the traversal at all. I'm taking time to nail the test cases first, but wanted to check that my understanding is correct!

@jeromekelleher
Copy link
Member

I think you're right, yeah.

@benjeffery
Copy link
Member Author

Replaced by #3313

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants