This project is an AI-powered clinical decision support system that assists healthcare professionals by analyzing patient reports and providing preliminary diagnostic insights. It leverages a Retrieval-Augmented Generation (RAG) architecture to deliver comprehensive and context-aware analysis.
The system takes a patient's medical report (in PDF format) as input, extracts relevant clinical data, and then compares it against a vast database of existing medical cases. By identifying similarities and patterns, it generates a detailed report that includes:
- Potential diseases or health concerns
- Recommended precautions and lifestyle changes
- Suggestions for FDA-approved medications and first-aid
- Actionable insights for doctors, supported by data from similar cases
- A clear list of the data points that informed the conclusion
This tool is designed to augment the expertise of medical professionals, not to replace it. The AI-generated analysis should always be reviewed and validated by a qualified doctor.
- Multi-Disease Analysis: The system is trained on datasets for various conditions, including diabetes, kidney stones, heart disease, and anemia.
- RAG Architecture: It uses a Retrieval-Augmented Generation (RAG) model to ground its analysis in real-world data, improving accuracy and relevance.
- Vector Similarity Search: Employs Qdrant and sentence-transformer embeddings to efficiently find similar patient cases from the knowledge base.
- Web-Enhanced Insights: Augments its analysis with real-time information from the web, ensuring the recommendations are current.
- Comprehensive Reporting: Generates detailed, multi-section reports to support clinical decision-making.
- Data Preprocessing: Medical datasets (CSV files) are cleaned, transformed, and stored as a collection of documents.
- Embedding: The processed documents are converted into vector embeddings using a sentence-transformer model and stored in a Qdrant vector database.
- Patient Report Analysis: When a new patient report (PDF) is provided, the system extracts the clinical text.
- Similarity Search: The extracted text is used to query the Qdrant database, retrieving the most similar patient cases.
- Web Search: An AI-generated query is used to search the web for additional, relevant medical information.
- Report Generation: A large language model (Llama 3.1) synthesizes the information from the patient report, similar cases, and web search results to generate a final, comprehensive analysis.
Follow these steps to set up and run the project locally.
- Python 3.8+
- Pip for package management
- A Hugging Face API token
- A Qdrant account (for cloud storage) or a local Qdrant instance
-
Clone the repository:
git clone https://github.com/your-username/Healthcare-Assistant.git cd Healthcare-Assistant
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate
-
Install the required packages:
pip install -r requirements.txt
-
Set up your environment variables: Create a
.env
file in the root directory and add your API keys:HUGGINGFACE_API_TOKEN="your_huggingface_api_token" QDRANT_API_KEY="your_qdrant_api_key"
The data
directory is not tracked by Git. You will need to create it and populate it with the necessary raw data files.
-
Create the directory structure:
mkdir -p data/raw
-
Add the raw data files:
Place your raw CSV data files in the
data/raw/
directory. The project is configured to use the following files:diabetes_classification.csv
kidney_stone_dataset.csv
heart_disease.csv
anemia.csv
thyroidDf.csv
Your
data
directory should look like this:data/ └── raw/ ├── diabetes_classification.csv ├── kidney_stone_dataset.csv ├── heart_disease.csv ├── anemia.csv └── thyroidDf.csv
The following datasets are used in this project:
- Kidney Stone Prediction
- Heart Disease Prediction
- Malaria Detection
- Anemia Prediction
- Thyroid Disease Prediction
-
Preprocess the data: Run the preprocessing script to prepare the datasets.
python src/pre-processing/preprocess.py
-
Create the embeddings: Generate embeddings from the preprocessed data and store them in your Qdrant database.
python src/embeddings/create_embeddings.py
-
Run the analysis: Place your patient report PDF in the
data/raw/
directory (e.g.,test_report_2.pdf
) and run the retrieval script.python src/embeddings/retrieve_embeddings.py
The system will output the final analysis to the console.