Zama.ai Bounty for Season 8: Implement an FHE-based Biological Age and Aging Pace Estimation ML Model Using Zama Libraries
- CpG group := (Cytosine - phosphate - Guanine)
- Notable for role in gene regulation through methylation processes
- Just polynomials of degree 1, very efficient representation in FHE applications
- Manhattan plot
- HumanMethylation450 BeadChip (released in 2008)
- Targets over 450k sites across the human genome
- Models trained to predict chronological age of tissue based on biomarkers
- Delta betwen chronological age and real age used as marker to predict
- Mortality risk
- Disease states
- etc.
- Inputs (CgPs)
- Outputs (Predicted chronological age)
Dataset | Model |
---|---|
Horvath | ElasticNet |
AltumAge | Deep Learning-based |
PCGrimAge | PCA based version of GrimAge |
GrimAge2 | Latest version of GrimAge ? |
DunedinPACE | Biomarker of the pace of aging |
- Start with simpler models (linear regression-based clocks) as easier to implement in FHE
- Balance accuracy vs compute complexity - some models might use hundreds of CpG sites that would be expensive in FHE
- Horvath clock well established but uses elastic net regression with many features
- DunedinPACE measures aging pace rather than biological age, which might be interesting but more complex
- Linear models most straightforward
- Avoid non-linear activation functions, if possible
- Consider feature count - Less CpG sites means faster FHE computation
- Some biological clocks use relatively few CpG sites (~10-50) which would be ideal (NOTE: Need to validate this)
- Start with Concrete ML for higher-level abstractions
- Need to quantise model (using
brevitas-nn
/concrete-ml
, etc.) - Benchmark acc between original + FHE - expect precision loss (though this depends on implementation / model, etc.)
- Reduce multiplicative depth
- Use concrete's compiler to analyse circuit depth / bottlenecks
- Precision vs performance trade-off (this one is likely key)
- Heavily consider preprocessing strategies pre-encryption to offload computation
- Client: Encrypts methylation data
- Server: Processes encrypted data without decryption
- Client: Receives and decrypts the predicted biological age
- Demo + sample data
Number of Features per Dataset (for `pyaging`)
Challenge Data
- The Illumina HumanMethylation450 BeadChip data
- GEO datasets like GSE40279 (often used for Horvath's clock)
- TCGA (The Cancer Genome Atlas) methylation data
27k_reference: probeAnnotation21kdatMethUsed
CBL_common: coefs
CBL_specific: coefs
Cortex_common: coefs
DunedinPACE: coefs gold_standard_means
HannumG2013: coefs
HorvathS2013: coefs
HorvathS2018: coefs
LevineM2018: coefs
LuA2019: coefs
McEwenL2019: coefs
ShirebyG2020: coefs
YangZ2016: epiTOCcpgs
ZhangQ2019: coefs
ZhangY2017: coefs
subGSE174422: betas info
betas
: Methylation beta values - actual DNA methylation measurements that serve as input
features for the model.
X
coefs
: Coefficient matrices for different biological clock models.
Each named entry represents a different published biological age clock with its trained
coefficients.
Weights?
probeAnnotation21kdatMethUsed
: Annotation data for DNA methlyation probes (CpG sites)
used in the models.