Skip to content

Conversation

khalilouardini
Copy link

@khalilouardini khalilouardini commented Oct 17, 2023

This PR has several purposes:

  • it leverages formulaic to allow using interaction terms of the form a:b:...:zfor design_factors and formulas based on a combination of single design factors and of such interaction terms a + b + a:b:...:z (no support for other more complex structures or alternative syntax enabled by formulaic's grammar yet such as i.e. ~ C(X, contr.treatment("x")), a * b, contr.poly, ...)
  • allows the user to pass a design_matrixto a dds dataset
  • fixes pyproject.toml syntax (linters complained)

Completion milestones:

  • adds @anaischossegros test on edge-case from deseq2 vignette
  • test to check that passing a design_matrix populates all necessary attributes such as design_factors (kind of as end2end integration tests are passing)
  • continuous factors well-handled
  • deseq2 examples matched (at least at the level of the design matrix)
  • arbitrary number of interactions well handled
  • deseq2 examples matched end2end

@jeandut jeandut marked this pull request as ready for review October 19, 2023 17:13
@jeandut
Copy link
Contributor

jeandut commented Oct 19, 2023

Youpi tests pass ! @khalilouardini and @BorisMuzellec you are on !

@BorisMuzellec
Copy link
Collaborator

BorisMuzellec commented Oct 20, 2023

Hi @jeandut and @khalilouardini, thanks for this PR!

It's nice that you went all the way to even support recursive interactions like "a: b:c".

I experimented a bit with your code, and there seem to be a few remaining issues though.
Running

from pydeseq2.utils import build_design_matrix, load_example_data

counts = load_example_data()
metadata = load_example_data(modality = "metadata")

design = build_design_matrix(metadata=clinical, 
                   design_factors=["condition", "condition:group"])

I get the following design:

          intercept  condition_B_vs_A  condition:group_AY_vs_A_vs_AX  \
sample1            1                 0                              0   
sample2            1                 0                              1   
sample3            1                 0                              0   
sample4            1                 0                              1   
sample5            1                 0                              0   
...              ...               ...                            ...   
sample96           1                 1                              0   
sample97           1                 1                              0   
sample98           1                 1                              0   
sample99           1                 1                              0   
sample100          1                 1                              0   

           condition:group_BX_vs_A_vs_AX  condition:group_BY_vs_A_vs_AX  
sample1                                0                              0  
sample2                                0                              0  
sample3                                0                              0  
sample4                                0                              0  
sample5                                0                              0  
...                                  ...                            ...  
sample96                               0                              1  
sample97                               1                              0  
sample98                               0                              1  
sample99                               1                              0  
sample100                              0                              1  

[100 rows x 5 columns]

The columns seem to contain the intended values, but:

  1. There's an issue with variable names. We would expect something like condition:group_AY_vs_AX instead of condition:group_AY_vs_A_vs_AX.

  2. The design is not full rank, e.g. ((design.iloc[:,-1] + design.iloc[:,-2] - design.iloc[:,1])**2).sum() returns 0. This means that one of the last two columns is redundant. Not sure how to determine this automatically though, perhaps looking at the way DESeq2 handles this could help.

Let me know if you need help with this :)

EDIT: I think that in the example above issue 2 is due to the fact the design is of the form "~a + a:b", which creates a redundancy that probably wouldn't be there if it was just "~a:b".

Copy link
Collaborator

@BorisMuzellec BorisMuzellec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above

@jeandut
Copy link
Contributor

jeandut commented Oct 20, 2023

@BorisMuzellec see the modifications I made. LGTM but it's hard to be sure wo test data available.
In order for us to converge faster and minimize iterations of reviews if the PR is still not doing exactly what you want I would appreciate if you could give me couples of:
(design_factors, metadata, continuous_factors) -> expected_design_matrix
For a sufficiently representative set of design_factors and continuous_factors.

@jeandut
Copy link
Contributor

jeandut commented Oct 20, 2023

As a start is this matching DeSeq2 ? and if not what should be changed ?
Capture d’écran 2023-10-20 à 18 07 11

@jeandut
Copy link
Contributor

jeandut commented Oct 20, 2023

And introducing continuous factors, same question:
Capture d’écran 2023-10-20 à 18 15 01

@BorisMuzellec
Copy link
Collaborator

BorisMuzellec commented Oct 23, 2023

Thanks @jeandut for the code updates. The variable names look fine now :).

Matrix rank

We still have the rank issue with designs of the form `"~ factor1 + factor1:factor2" though. E.g., in the same example as above,

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
design = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])

we have the following design:

    intercept  condition_B_vs_A  condition:group_AY_vs_AX  \
sample1            1                 0                         0   
sample2            1                 0                         1   
sample3            1                 0                         0   
sample4            1                 0                         1   
sample5            1                 0                         0   
...              ...               ...                       ...   
sample96           1                 1                         0   
sample97           1                 1                         0   
sample98           1                 1                         0   
sample99           1                 1                         0   
sample100          1                 1                         0   

           condition:group_BX_vs_AX  condition:group_BY_vs_AX  
sample1                           0                         0  
sample2                           0                         0  
sample3                           0                         0  
sample4                           0                         0  
sample5                           0                         0  
...                             ...                       ...  
sample96                          0                         1  
sample97                          1                         0  
sample98                          0                         1  
sample99                          1                         0  
sample100                         0                         1  

which does not have full column rank, because condition_B_vs_A = condition:group_BX_vs_AX condition:group_BY_vs_AX (since group has only two values X and Y, knowing BX and BY is enough to know B).

In comparison, the design matrix output by DESeq2 only has the following columns ["intercept", "condition_B_vs_A", "condition:group_AY_vs_AX", "condition:group_BY_vs_AX"].

I think that when there are interaction terms in the design, we need to check whether those variables are also present on their own, and if so remove an additional column.
I added a test to check that the design has full rank in this case, which is why the CI now fails.

In-place modification

On a side note, the present code modifies the metadata that is being passed (it adds colums). E.G:

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
print(metadata)

 condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

_ = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])
print(metadata)

      condition group condition:group
sample1           A     X              AX
sample2           A     Y              AY
sample3           A     X              AX
sample4           A     Y              AY
sample5           A     X              AX
...             ...   ...             ...
sample96          B     Y              BY
sample97          B     X              BX
sample98          B     Y              BY
sample99          B     X              BX
sample100         B     Y              BY

It would be better to avoid this, e.g. by adding an inplace argument to the interaction term utilities, or deleting added columns after the code is done running.

I'm not a huge fan of adding to many dependencies, but I'm starting to wonder if we could save us some pain by relying on formulaic, as suggested in #125...

@thondeboer
Copy link

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

@jeandut
Copy link
Contributor

jeandut commented Jan 5, 2024

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

  1. This attempt has not been abandoned. Currently it has been because of lack of bandwidth that I could not make more progress, I don't want to make hard commitments but I hope this gets done in Q1.
  2. However indeed this turned out to be more complicated than I expected mainly because of the versatility of the formula and its interaction with the rank-reduction step. For this reason I will rely on formulaic
  3. In the meantime all those manipulations can be done in Python on a case by case basis outside of pydeseq2

@BorisMuzellec BorisMuzellec marked this pull request as draft October 25, 2024 09:55
BorisMuzellec
BorisMuzellec previously approved these changes Oct 25, 2024
@BorisMuzellec BorisMuzellec marked this pull request as ready for review October 28, 2024 10:48
Copy link
Collaborator

@umarteauowkin umarteauowkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, thanks for this huge PR
One change requested: there are some redundancies in the utils that I pointed out + one weird behaviour for me where extrated_single_factors -> single_factors seem to have different behaviours. I think we could simplify and add precisions to what these quantities actually are. Could you please check this ?
Apart from that, I think we could add comments to help read the code (and modify it if needed)

itertools.chain(*[factor.split(":") for factor in design_factors])
)

single_design_factors = extracted_single_factors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the next case (design_factor is a str), shouldn t we take care of the non unique stuff here as well ?

design_factors_list : list
List of design factors. May contain interaction terms.

single_design_factors : list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify if they are unique / in what order they are ?

Comment on lines +657 to +667
potential_values = [
"".join(interacting_values)
for interacting_values in list(
itertools.product(
*[
self.dds.obs[col].cat.categories.tolist()
for col in categorical_columns_interacting
]
)
)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment on what this does

@jeandut
Copy link
Contributor

jeandut commented Oct 29, 2024

Closing because of #328

@jeandut jeandut closed this Oct 29, 2024
@abearab
Copy link

abearab commented Jan 7, 2025

@jeandut hello and happy new year! is this getting back in radar?

@BorisMuzellec
Copy link
Collaborator

@jeandut hello and happy new year! is this getting back in radar?

@abearab Interaction terms were implemented in #328 and are now available as part of the 0.5.0 pre-release

(Run pip install --pre --upgrade pydeseq2 to get it)

@abearab
Copy link

abearab commented Jan 7, 2025

@jeandut hello and happy new year! is this getting back in radar?

@abearab Interaction terms were implemented in #328 and are now available as part of the 0.5.0 pre-release

(Run pip install --pre --upgrade pydeseq2 to get it)

Awesome, sorry I miss interpreted that. Thanks for the clarification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants