-
Notifications
You must be signed in to change notification settings - Fork 79
Remove guard against internal samples when decoding genotypes #3301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…d checking when they are isolated
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3301 +/- ##
==========================================
+ Coverage 89.79% 89.80% +0.01%
==========================================
Files 29 29
Lines 31008 31042 +34
Branches 5673 5681 +8
==========================================
+ Hits 27843 27877 +34
Misses 1778 1778
Partials 1387 1387
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really getting this, so I think we need to move up to Python first.
c/tskit/genotypes.c
Outdated
| } | ||
| self->sample_is_present | ||
| = tsk_malloc(num_samples_alloc * sizeof(*self->sample_is_present)); | ||
| if (self->sample_is_present == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usual practise is to bundle multiple mem error checks together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
c/tskit/genotypes.c
Outdated
| tsk_memset(present, 0, self->num_samples * sizeof(*present)); | ||
|
|
||
| for (root = left_child[N]; root != TSK_NULL; root = right_sib[root]) { | ||
| stack[++stack_top] = root; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use return value of increment operators, separate into two statements. Please use the existing patterns for these kinds of traversals so I don't have to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
Before I go ahead with the Python, I'll try explaining (100% human text!)? The reason that internal samples are problematic is that, unlike true leaf sample nodes, they are sometimes absent from a tree. This means we don't visit them, and they get the ancestral state rather than the missingness they should have had. To fix this, when we have |
|
That makes sense. What's not clear to me is whether this traversal is done for each site independently, or it's something we're amortising across the sites in a tree. So, for each tree, we could build the list of missing samples at the cost of one traversal (which would be fine), right? |
b6b2bf0 to
b5160d5
Compare
b5160d5 to
dc9e33b
Compare
|
Yeah, was trying to keep it clean, but I thought about it and in dc9e33b we keep the missing samples and only update if the tree has changed. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks good. See comment about needing the traversal?
| present[j] | ||
| = present[j] | ||
| && !((flags[u] & TSK_NODE_IS_SAMPLE) != 0 | ||
| && self->tree.parent[u] == TSK_NULL && left_child[u] == TSK_NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get a restrict pointer for self->parent also
| if (!impute_missing) { | ||
| if (self->alt_samples != NULL | ||
| && self->missingmess_cache_tree_index != self->tree.index) { | ||
| tsk_variant_update_missing_cache(self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
|
|
||
| tsk_memset(present, 0, self->num_samples * sizeof(*present)); | ||
|
|
||
| stack_top = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a bit dumb, but why do we need to do the traversal? Can't we just do the num_samples loop below, check each sample to see if it's isolated?
|
So might be easier to take this to the final stages by making some Python test cases that'll clearly pick out the different corner cases for discussion. |
Agreed, I've thoroughly confused myself and am cracking out the ascii art trees for salvation. |
|
I think I've finally understood this now. I was getting hung up on roots and how they fit into the missingness classification but a root can never be a non-sample node without parents or children (i.e. isolated) as roots have to be reachable up the tree from samples. I think this means we don't need the traversal at all. I'm taking time to nail the test cases first, but wanted to check that my understanding is correct! |
|
I think you're right, yeah. |
|
Replaced by #3313 |
@jeromekelleher This seems to be the most straightforward way of supporting internal nodes with missingness - if we have a specific set of nodes we traverse to check which are internal and missing.