Skip to content

uidops/DIVAN-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DIVAN-LM

End-to-end tooling for Persian legal language modeling: OCR for books, scrapers for legal Q&A sites, RAG corpus building, LoRA fine-tuning, and lightweight clients/apps for local inference.

What's Inside

  • src/ocr: Persian-optimized OCR pipeline (Tesseract-based) with preprocessing and normalization.
  • src/scraper: Site-specific scrapers to collect legal Q&A data (listing + single-page crawlers).
  • src/rag: RAG corpus builder that embeds normalized data with BAAI/bge-m3 and persists to Chroma.
  • src/train: LoRA fine-tuning stack for causal LMs (CUDA/MPS aware) plus merge/test helpers.
  • src/adapter: LM Studio client for interacting with local REST servers.
  • src/app: Minimal Django + vanilla JS chat app for llama.cpp-backed local models.

Quick Start

  • Create a virtual environment and install module-specific deps as needed:
    python -m venv .venv
    source .venv/bin/activate
    pip install -r src/ocr/requirements.txt         # OCR
    pip install -r src/scraper/requirements.txt     # Scrapers
  • OCR a PDF (see src/ocr/README.md for more):
    python src/ocr/ocr.py --input path/to/book.pdf --output out.json
  • Build a RAG vector store:
    python src/rag/rag.py \
      --data-path data/normalized_data/final_books.json \
      --persist-dir ./persian_rag_db
  • Train a LoRA adapter (defaults in config.LoRAConfig):
    cd src/train
    python train.py
  • Test a merged model (after training/merging):
    cd src/train/scripts
    python test.py
  • Scrape legal Q&A data (examples vary by site; see individual READMEs under src/scraper/*).

Project Layout

src/
  adapter/    # LM Studio client + console app
  app/        # Django backend + static frontend for llama.cpp chat
  ocr/        # Persian legal OCR pipeline
  rag/        # RAG document builder + demo query script
  scraper/    # Site-specific scrapers (listing/single-page)
  train/      # LoRA training, merging, and smoke tests
data/         # Normalized datasets (not included in repo by default)

Notes

  • Hardware: training and embeddings adapt to CUDA or Apple MPS; fall back to CPU.
  • Data: paths in configs default to local normalized JSON files—override via CLI flags or LoRAConfig.
  • Each subdirectory has its own README with deeper instructions and parameters.

About

DIVAN large langauge model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •