This project implements a full RAG‑style research assistant:
-
train_pipeline.py— end‑to‑end training:- PDF ingestion
- SQLite staging
- T5 summarization & fine‑tuning
- “Lite” dataset creation
- LLaMA fine‑tuning
-
rag_pipeline.py— interactive RAG chatbot:- ChromaDB vector store
- LangChain RetrieverQA
- Fine‑tuned LLaMA generation
-
test_rag.py— automated test suite for your RAG chatbot.
All scripts assume your code lives under C:\codes\….
pipeline_project/
├── train_pipeline.py
├── rag_pipeline.py
├── test_rag.py
├── migrate_sqlite_to_chromadb.py
├── pdfs.py
├── pdf_pre.py
├── model.py
├── llama_model.py
├── database_handler.py
├── data_pre.py
└── (optional helper scripts)
-
train_pipeline.py
Orchestrates data ingestion, summarization, and model fine‑tuning in one shot. -
rag_pipeline.py
Loads ChromaDB + LangChain + LLaMA to serve an interactive chatbot. -
test_rag.py
Runs a list of test questions through your RAG chain and logs outputs. -
migrate_sqlite_to_chromadb.py
One‑time migration of all SQLite data into ChromaDB. -
pdfs.py/pdf_pre.py
Download, extract, and clean PDF text. -
model.py
T5 summarization & fine‑tuning utilities. -
llama_model.py
LLaMA fine‑tuning utilities. -
database_handler.py
SQLite schema & CRUD helpers. -
data_pre.py
Text‑to‑T5 preprocessing helper.
Runs the full data → model training loop:
- Download PDFs (from your merged CSV)
- Extract & ingest into SQLite
- Summarize with T5 and update DB
- Fine‑tune T5 on
(full_text → summary) - Create lite DB/pickle/CSV
- Fine‑tune LLaMA on
(input_text → target_text)
download_pdfs()process_pdfs_into_sqlite()generate_summaries_and_finetune_t5()create_lite_and_finetune_llama()
Each step is checkpoint‑aware and resumes from the latest checkpoint.
Serves an interactive Retrieval‑Augmented Generation chatbot:
- Loads ChromaDB (persisted vector store)
- Uses Sentence‑Transformers embeddings
- Instantiates a LangChain
RetrievalQAchain - Wraps your fine‑tuned LLaMA in a HuggingFacePipeline
- Provides a REPL chat loop
- Configuration (paths, model names, device)
- Retriever instantiation
- LLM loading & pipeline
RetrievalQA.from_chain_type(...)chat()loop
Executes a predefined list of questions against your RAG pipeline and logs results:
TEST_QUESTIONSarraylog_entry()writes to CSVif __name__ == "__main__":iterates, runsqa.run(...), logs
One‑time migration of your SQLite tables into ChromaDB:
- Fetches
worksandresearch_info - Chunks long texts with
RecursiveCharacterTextSplitter - Embeds with Sentence‑Transformers (
all-MiniLM-L6-v2) - Adds documents to Chroma collection and persists
pdfs.py: Downloads PDFs from a CSV.pdf_pre.py:extract_text_from_pdf(file_path)clean_text(text)extract_research_info_from_pdf(file_path)
T5 summarization & fine‑tuning helpers:
summarize_text(text, idx=None, total=None)fine_tune_t5_on_papers(dataset, output_dir)
Supports resuming from the latest checkpoint.
LLaMA fine‑tuning helpers:
-
fine_tune_llama_on_papers(dataset, output_dir)- Masks prompt tokens, computes loss only on summary tokens
- Resumes from checkpoint
-
clear_memory()
CRUD operations for SQLite:
setup_database(),setup_research_info_table()insert_work(...),remove_duplicates(),fetch_unsummarized_works()update_summary(work_id, summary)insert_research_info(...),fetch_research_info()count_entries_in_table(),check_missing_files_in_db(...)
preprocess_text_for_t5(text, model_name="t5-small")
-
train_pipeline.py
Orchestrates data ingestion, summarization, and model fine‑tuning in one shot. -
rag_pipeline.py
Loads ChromaDB + LangChain + LLaMA to serve an interactive chatbot. -
test_rag.py
Runs a list of test questions through your RAG chain and logs outputs. -
migrate_sqlite_to_chromadb.py
One‑time migration of all SQLite data into ChromaDB. -
pdfs.py/pdf_pre.py
Download, extract, and clean PDF text. -
model.py
T5 summarization & fine‑tuning utilities. -
llama_model.py
LLaMA fine‑tuning utilities. -
database_handler.py
SQLite schema & CRUD helpers. -
data_pre.py
Text‑to‑T5 preprocessing helper.
flowchart LR
subgraph Training Pipeline
A[train_pipeline.py]
end
subgraph RAG Service
B[rag_pipeline.py]
C[test_rag.py]
end
subgraph Helpers
D[migrate_sqlite_to_chromadb.py]
E[pdfs.py] & F[pdf_pre.py]
G[model.py] & H[llama_model.py]
I[database_handler.py] & J[data_pre.py]
end
A --> I
A --> E
A --> F
A --> G
A --> H
B --> D
B --> G
B --> H
C --> B