End-to-end tooling for Persian legal language modeling: OCR for books, scrapers for legal Q&A sites, RAG corpus building, LoRA fine-tuning, and lightweight clients/apps for local inference.
src/ocr: Persian-optimized OCR pipeline (Tesseract-based) with preprocessing and normalization.src/scraper: Site-specific scrapers to collect legal Q&A data (listing + single-page crawlers).src/rag: RAG corpus builder that embeds normalized data withBAAI/bge-m3and persists to Chroma.src/train: LoRA fine-tuning stack for causal LMs (CUDA/MPS aware) plus merge/test helpers.src/adapter: LM Studio client for interacting with local REST servers.src/app: Minimal Django + vanilla JS chat app for llama.cpp-backed local models.
- Create a virtual environment and install module-specific deps as needed:
python -m venv .venv source .venv/bin/activate pip install -r src/ocr/requirements.txt # OCR pip install -r src/scraper/requirements.txt # Scrapers
- OCR a PDF (see
src/ocr/README.mdfor more):python src/ocr/ocr.py --input path/to/book.pdf --output out.json
- Build a RAG vector store:
python src/rag/rag.py \ --data-path data/normalized_data/final_books.json \ --persist-dir ./persian_rag_db
- Train a LoRA adapter (defaults in
config.LoRAConfig):cd src/train python train.py - Test a merged model (after training/merging):
cd src/train/scripts python test.py - Scrape legal Q&A data (examples vary by site; see individual READMEs under
src/scraper/*).
src/
adapter/ # LM Studio client + console app
app/ # Django backend + static frontend for llama.cpp chat
ocr/ # Persian legal OCR pipeline
rag/ # RAG document builder + demo query script
scraper/ # Site-specific scrapers (listing/single-page)
train/ # LoRA training, merging, and smoke tests
data/ # Normalized datasets (not included in repo by default)
- Hardware: training and embeddings adapt to CUDA or Apple MPS; fall back to CPU.
- Data: paths in configs default to local normalized JSON files—override via CLI flags or
LoRAConfig. - Each subdirectory has its own README with deeper instructions and parameters.