Skip to content

Using ML to predict whether drugs can cross the Blood Brain Barrier (BBB) with molecular descriptors extracted from chemical structures via RDKit toolkit.

License

Notifications You must be signed in to change notification settings

orvelte/BBB_permeability_prediction

Repository files navigation

Blood-Brain Barrier (BBB) Permeability Prediction

This project uses RDKit to extract molecular descriptors from drug SMILES notation and applies machine learning techniques to predict Blood Brain Barrier (BBB) permeability.

Features

  • Molecular Analysis: Extract 200+ molecular descriptors from SMILES notation
  • Similarity Analysis: Compare molecular similarity using Morgan fingerprints and Tanimoto coefficients
  • Exploratory Data Analysis: Comprehensive EDA with PCA visualization
  • Machine Learning Models: Implemented Random Forest, SVM, and Logistic Regression for BBB prediction
  • Model Evaluation: Cross-validation, confusion matrices, ROC curves, and feature importance analysis

Project Structure

BBB_permeability_prediction/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── .gitignore               # Git ignore file
├── LICENSE                  # MIT License
├── data/                    # Data directory
│   ├── sample_data.csv      # Sample dataset
│   └── README.md           # Data documentation
├── src/                     # Source code
│   ├── bbb.py              # Main analysis script
│   ├── ml_models.py        # Machine learning models
│   └── predict_bbb.py      # Prediction script
├── notebooks/               # Jupyter notebooks
│   └── bbb_analysis.ipynb  # Interactive analysis
├── docs/                    # Documentation
│   └── molecular_descriptors.md
├── results/                 # Output files
│   ├── plots/              # Generated plots
│   └── models/             # Trained models
└── tests/                  # Unit tests
    └── test_bbb.py

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Setup

  1. Clone the repository:
git clone <repository-url>
cd BBB_permeability_prediction
  1. Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

Basic Usage

  1. Place your BBB dataset as data/BBB_datasets.csv with columns:

    • SMILES: Chemical structure in SMILES notation
    • Class: BBB permeability class (BBB+ or BBB-)
  2. Run the analysis:

python src/bbb.py
  1. Predict BBB permeability for new molecules:
python src/predict_bbb.py "CC(=O)NC1=CC=C(C=C1)O" "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"

Or using Makefile:

make predict SMILES="CC(=O)NC1=CC=C(C=C1)O"

Jupyter Notebook

For interactive analysis, use the Jupyter notebook:

jupyter notebook notebooks/bbb_analysis.ipynb

Dataset Format

The script expects a CSV file with the following columns:

Column Description Example
SMILES Chemical structure in SMILES notation CC(=O)NC1=CC=C(C=C1)O
Class BBB permeability class BBB+ or BBB-

Molecular Descriptors

The script extracts 200+ molecular descriptors including:

  • Molecular weight
  • LogP (lipophilicity)
  • Number of rotatable bonds
  • Hydrogen bond donors/acceptors
  • Topological descriptors
  • And many more...

Output

The script generates:

  • Molecular similarity analysis
  • PCA visualization of molecular descriptors
  • Statistical summaries of the dataset
  • Plots showing drug clustering by BBB permeability
  • Machine learning model training and evaluation
  • Feature importance analysis
  • Model performance comparison
  • Confusion matrices and ROC curves
  • Trained models saved for future predictions

Example Molecules Analyzed

  • Paracetamol: CC(=O)NC1=CC=C(C=C1)O
  • Caffeine: CN1C=NC2=C1C(=O)N(C(=O)N2C)C
  • Theophylline: CN1C2=C(C(=O)N(C1=O)C)NC=N2
  • MDMA: CC(CC1=CC2=C(C=C1)OCO2)NC

Dependencies

  • RDKit: Cheminformatics toolkit
  • pandas: Data manipulation
  • numpy: Numerical computing
  • matplotlib: Plotting
  • seaborn: Statistical visualization
  • scikit-learn: Machine learning

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • RDKit community for the excellent cheminformatics toolkit
  • Original Colab notebook: BBB Analysis

Machine Learning Models

The project implements three machine learning algorithms:

1. Random Forest Classifier

  • Advantages: Handles non-linear relationships, provides feature importance
  • Parameters: 100 estimators, max depth 10, min samples split 5
  • Use Case: Best for interpretable predictions with feature importance

2. Support Vector Machine (SVM)

  • Advantages: Effective for high-dimensional data, good generalization
  • Parameters: RBF kernel, C=1.0, gamma='scale'
  • Use Case: Good performance on molecular descriptor data

3. Logistic Regression

  • Advantages: Fast training, interpretable coefficients
  • Parameters: C=1.0, max iterations 1000
  • Use Case: Baseline model and fast predictions

Model Evaluation

  • Cross-validation: 5-fold CV for robust performance estimation
  • Metrics: Accuracy, precision, recall, F1-score, AUC-ROC
  • Feature Importance: Top 20 most important molecular descriptors
  • Visualization: Confusion matrices, ROC curves, performance comparison

Future Enhancements

  • Implement machine learning models (Random Forest, SVM, Logistic Regression)
  • Add feature importance analysis
  • Cross-validation and model evaluation
  • Hyperparameter tuning with GridSearch
  • Neural Networks (Deep Learning)
  • Web interface for drug prediction
  • API for batch processing
  • Integration with drug databases

Troubleshooting

Common Issues

  1. RDKit installation issues: Try using conda instead of pip:

    conda install -c conda-forge rdkit
  2. Missing dataset: Ensure BBB_datasets.csv is in the data/ directory

  3. Memory issues: For large datasets, consider processing in batches

Getting Help

  • Check the Issues page
  • Create a new issue with detailed error information
  • Include your Python version and operating system

Citation

If you use this project in your research, please cite:

@software{bbb_prediction,
  title={Blood-Brain Barrier Permeability Prediction},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/BBB_permeability_prediction}
}

About

Using ML to predict whether drugs can cross the Blood Brain Barrier (BBB) with molecular descriptors extracted from chemical structures via RDKit toolkit.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published