This project uses RDKit to extract molecular descriptors from drug SMILES notation and applies machine learning techniques to predict Blood Brain Barrier (BBB) permeability.
- Molecular Analysis: Extract 200+ molecular descriptors from SMILES notation
- Similarity Analysis: Compare molecular similarity using Morgan fingerprints and Tanimoto coefficients
- Exploratory Data Analysis: Comprehensive EDA with PCA visualization
- Machine Learning Models: Implemented Random Forest, SVM, and Logistic Regression for BBB prediction
- Model Evaluation: Cross-validation, confusion matrices, ROC curves, and feature importance analysis
BBB_permeability_prediction/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore file
├── LICENSE # MIT License
├── data/ # Data directory
│ ├── sample_data.csv # Sample dataset
│ └── README.md # Data documentation
├── src/ # Source code
│ ├── bbb.py # Main analysis script
│ ├── ml_models.py # Machine learning models
│ └── predict_bbb.py # Prediction script
├── notebooks/ # Jupyter notebooks
│ └── bbb_analysis.ipynb # Interactive analysis
├── docs/ # Documentation
│ └── molecular_descriptors.md
├── results/ # Output files
│ ├── plots/ # Generated plots
│ └── models/ # Trained models
└── tests/ # Unit tests
└── test_bbb.py
- Python 3.7 or higher
- pip package manager
- Clone the repository:
git clone <repository-url>
cd BBB_permeability_prediction
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
-
Place your BBB dataset as
data/BBB_datasets.csv
with columns:SMILES
: Chemical structure in SMILES notationClass
: BBB permeability class (BBB+ or BBB-)
-
Run the analysis:
python src/bbb.py
- Predict BBB permeability for new molecules:
python src/predict_bbb.py "CC(=O)NC1=CC=C(C=C1)O" "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
Or using Makefile:
make predict SMILES="CC(=O)NC1=CC=C(C=C1)O"
For interactive analysis, use the Jupyter notebook:
jupyter notebook notebooks/bbb_analysis.ipynb
The script expects a CSV file with the following columns:
Column | Description | Example |
---|---|---|
SMILES | Chemical structure in SMILES notation | CC(=O)NC1=CC=C(C=C1)O |
Class | BBB permeability class | BBB+ or BBB- |
The script extracts 200+ molecular descriptors including:
- Molecular weight
- LogP (lipophilicity)
- Number of rotatable bonds
- Hydrogen bond donors/acceptors
- Topological descriptors
- And many more...
The script generates:
- Molecular similarity analysis
- PCA visualization of molecular descriptors
- Statistical summaries of the dataset
- Plots showing drug clustering by BBB permeability
- Machine learning model training and evaluation
- Feature importance analysis
- Model performance comparison
- Confusion matrices and ROC curves
- Trained models saved for future predictions
- Paracetamol:
CC(=O)NC1=CC=C(C=C1)O
- Caffeine:
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
- Theophylline:
CN1C2=C(C(=O)N(C1=O)C)NC=N2
- MDMA:
CC(CC1=CC2=C(C=C1)OCO2)NC
- RDKit: Cheminformatics toolkit
- pandas: Data manipulation
- numpy: Numerical computing
- matplotlib: Plotting
- seaborn: Statistical visualization
- scikit-learn: Machine learning
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- RDKit community for the excellent cheminformatics toolkit
- Original Colab notebook: BBB Analysis
The project implements three machine learning algorithms:
- Advantages: Handles non-linear relationships, provides feature importance
- Parameters: 100 estimators, max depth 10, min samples split 5
- Use Case: Best for interpretable predictions with feature importance
- Advantages: Effective for high-dimensional data, good generalization
- Parameters: RBF kernel, C=1.0, gamma='scale'
- Use Case: Good performance on molecular descriptor data
- Advantages: Fast training, interpretable coefficients
- Parameters: C=1.0, max iterations 1000
- Use Case: Baseline model and fast predictions
- Cross-validation: 5-fold CV for robust performance estimation
- Metrics: Accuracy, precision, recall, F1-score, AUC-ROC
- Feature Importance: Top 20 most important molecular descriptors
- Visualization: Confusion matrices, ROC curves, performance comparison
- Implement machine learning models (Random Forest, SVM, Logistic Regression)
- Add feature importance analysis
- Cross-validation and model evaluation
- Hyperparameter tuning with GridSearch
- Neural Networks (Deep Learning)
- Web interface for drug prediction
- API for batch processing
- Integration with drug databases
-
RDKit installation issues: Try using conda instead of pip:
conda install -c conda-forge rdkit
-
Missing dataset: Ensure
BBB_datasets.csv
is in thedata/
directory -
Memory issues: For large datasets, consider processing in batches
- Check the Issues page
- Create a new issue with detailed error information
- Include your Python version and operating system
If you use this project in your research, please cite:
@software{bbb_prediction,
title={Blood-Brain Barrier Permeability Prediction},
author={Your Name},
year={2024},
url={https://github.com/yourusername/BBB_permeability_prediction}
}