Pipeline Overview

This project implements a full RAG‑style research assistant:

train_pipeline.py — end‑to‑end training:
- PDF ingestion
- SQLite staging
- T5 summarization & fine‑tuning
- “Lite” dataset creation
- LLaMA fine‑tuning
rag_pipeline.py — interactive RAG chatbot:
- ChromaDB vector store
- LangChain RetrieverQA
- Fine‑tuned LLaMA generation
test_rag.py — automated test suite for your RAG chatbot.

All scripts assume your code lives under C:\codes\….

Folder Structure

pipeline_project/

├── train_pipeline.py

├── rag_pipeline.py

├── test_rag.py

├── migrate_sqlite_to_chromadb.py

├── pdfs.py

├── pdf_pre.py

├── model.py

├── llama_model.py

├── database_handler.py

├── data_pre.py

└── (optional helper scripts)

train_pipeline.py
Orchestrates data ingestion, summarization, and model fine‑tuning in one shot.
rag_pipeline.py
Loads ChromaDB + LangChain + LLaMA to serve an interactive chatbot.
test_rag.py
Runs a list of test questions through your RAG chain and logs outputs.
migrate_sqlite_to_chromadb.py
One‑time migration of all SQLite data into ChromaDB.
pdfs.py / pdf_pre.py
Download, extract, and clean PDF text.
model.py
T5 summarization & fine‑tuning utilities.
llama_model.py
LLaMA fine‑tuning utilities.
database_handler.py
SQLite schema & CRUD helpers.
data_pre.py
Text‑to‑T5 preprocessing helper.

Detailed File and Function Documentation

1. train_pipeline.py

Purpose

Runs the full data → model training loop:

Download PDFs (from your merged CSV)
Extract & ingest into SQLite
Summarize with T5 and update DB
Fine‑tune T5 on (full_text → summary)
Create lite DB/pickle/CSV
Fine‑tune LLaMA on (input_text → target_text)

Key Functions

download_pdfs()
process_pdfs_into_sqlite()
generate_summaries_and_finetune_t5()
create_lite_and_finetune_llama()

Each step is checkpoint‑aware and resumes from the latest checkpoint.

2. rag_pipeline.py

Purpose

Serves an interactive Retrieval‑Augmented Generation chatbot:

Loads ChromaDB (persisted vector store)
Uses Sentence‑Transformers embeddings
Instantiates a LangChain RetrievalQA chain
Wraps your fine‑tuned LLaMA in a HuggingFacePipeline
Provides a REPL chat loop

Key Sections

Configuration (paths, model names, device)
Retriever instantiation
LLM loading & pipeline
RetrievalQA.from_chain_type(...)
chat() loop

3. test_rag.py

Purpose

Executes a predefined list of questions against your RAG pipeline and logs results:

TEST_QUESTIONS array
log_entry() writes to CSV
if __name__ == "__main__": iterates, runs qa.run(...), logs

4. migrate_sqlite_to_chromadb.py

Purpose

One‑time migration of your SQLite tables into ChromaDB:

Fetches works and research_info
Chunks long texts with RecursiveCharacterTextSplitter
Embeds with Sentence‑Transformers (all-MiniLM-L6-v2)
Adds documents to Chroma collection and persists

5. pdfs.py & pdf_pre.py

pdfs.py: Downloads PDFs from a CSV.
pdf_pre.py:
- extract_text_from_pdf(file_path)
- clean_text(text)
- extract_research_info_from_pdf(file_path)

6. model.py

Purpose

T5 summarization & fine‑tuning helpers:

summarize_text(text, idx=None, total=None)
fine_tune_t5_on_papers(dataset, output_dir)

Supports resuming from the latest checkpoint.

7. llama_model.py

Purpose

LLaMA fine‑tuning helpers:

fine_tune_llama_on_papers(dataset, output_dir)
- Masks prompt tokens, computes loss only on summary tokens
- Resumes from checkpoint
clear_memory()

8. database_handler.py

CRUD operations for SQLite:

setup_database(), setup_research_info_table()
insert_work(...), remove_duplicates(), fetch_unsummarized_works()
update_summary(work_id, summary)
insert_research_info(...), fetch_research_info()
count_entries_in_table(), check_missing_files_in_db(...)

9. data_pre.py

preprocess_text_for_t5(text, model_name="t5-small")

Dependency Diagram

train_pipeline.py
Orchestrates data ingestion, summarization, and model fine‑tuning in one shot.
rag_pipeline.py
Loads ChromaDB + LangChain + LLaMA to serve an interactive chatbot.
test_rag.py
Runs a list of test questions through your RAG chain and logs outputs.
migrate_sqlite_to_chromadb.py
One‑time migration of all SQLite data into ChromaDB.
pdfs.py / pdf_pre.py
Download, extract, and clean PDF text.
model.py
T5 summarization & fine‑tuning utilities.
llama_model.py
LLaMA fine‑tuning utilities.
database_handler.py
SQLite schema & CRUD helpers.
data_pre.py
Text‑to‑T5 preprocessing helper.

Dependency & Pipeline Flowchart

flowchart LR
    subgraph Training Pipeline
      A[train_pipeline.py]
    end

    subgraph RAG Service
      B[rag_pipeline.py]
      C[test_rag.py]
    end

    subgraph Helpers
      D[migrate_sqlite_to_chromadb.py]
      E[pdfs.py] & F[pdf_pre.py]
      G[model.py] & H[llama_model.py]
      I[database_handler.py] & J[data_pre.py]
    end

    A --> I
    A --> E
    A --> F
    A --> G
    A --> H

    B --> D
    B --> G
    B --> H

    C --> B

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Chatbot_run.md		Chatbot_run.md
README.md		README.md
clean_db.py		clean_db.py
csv_handler.py		csv_handler.py
database_handler.py		database_handler.py
fine_tune_llama_rag.py		fine_tune_llama_rag.py
ingest_pdf_fulltext.py		ingest_pdf_fulltext.py
llama_data_formatter.py		llama_data_formatter.py
migrate_to_chromadb.py		migrate_to_chromadb.py
model.py		model.py
pdf_pre.py		pdf_pre.py
pdfs.py		pdfs.py
run_pipeline.py		run_pipeline.py
summarize_works.py		summarize_works.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pipeline Overview

Folder Structure

Detailed File and Function Documentation

1. train_pipeline.py

Purpose

Key Functions

2. rag_pipeline.py

Purpose

Key Sections

3. test_rag.py

Purpose

4. migrate_sqlite_to_chromadb.py

Purpose

5. pdfs.py & pdf_pre.py

6. model.py

Purpose

7. llama_model.py

Purpose

8. database_handler.py

9. data_pre.py

Dependency Diagram

Dependency & Pipeline Flowchart

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SyracuseUniversity/research-llm

Folders and files

Latest commit

History

Repository files navigation

Pipeline Overview

Folder Structure

Detailed File and Function Documentation

1. train_pipeline.py

Purpose

Key Functions

2. rag_pipeline.py

Purpose

Key Sections

3. test_rag.py

Purpose

4. migrate_sqlite_to_chromadb.py

Purpose

5. pdfs.py & pdf_pre.py

6. model.py

Purpose

7. llama_model.py

Purpose

8. database_handler.py

9. data_pre.py

Dependency Diagram

Dependency & Pipeline Flowchart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages