Skip to content

lohpaul9/ScaleLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScaleLLM: A Technique for Scalable LLM-augmented Data Systems

Large language models (LLMs) offer powerful semantic insights for data analytics, but row-by-row LLM calls quickly become prohibitively expensive in large datasets. ScaleLLM is a novel system that substantially reduces both latency and cost on text classification tasks by coupling LLM-generated labels on a small subset of data with a lightweight machine learning model for large-scale inference.

This approach provides significant speed-ups—up to 37×—while maintaining accuracy close to that of a full LLM baseline, converging within 1% of its accuracy in several tasks. ScaleLLM also provides cost-accuracy trade-off projections, giving users fine-grained control over performance trade-offs.

Features

  • Efficient Inference: Up to 37× speed-up compared to full LLM baselines
  • Cost Optimization: Significant reduction in API costs while maintaining accuracy
  • Embedding Views: Reusable embedding representations for efficient querying
  • Web Interface: Visual UI for exploring and analyzing results
  • Multiple Datasets: Support for various text classification tasks including:
    • Yelp restaurant reviews classification
    • Yahoo Answers classification
    • Hate speech detection
    • Offensive tweets classification
    • MTOP dataset processing

Prerequisites

  • Python 3.10
  • PostgreSQL with pgvector extension
  • Node.js (for frontend)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd scalellm
  2. Install Python dependencies:

    pip install -r requirements.txt
  3. Set up environment variables: Create a .env file in the root directory with your configuration:

    # Database configuration
    DATABASE_URL=postgresql://postgres:postgres@localhost:5432/dev
    
    # OpenAI API key (required for LLM operations)
    OPENAI_API_KEY=your_openai_api_key_here

Running Instructions

1. Start PostgreSQL Instance

Option A: Using Docker Compose (Recommended)

docker-compose up -d postgres

This will start PostgreSQL with pgvector extension on port 5432.

Option B: Local PostgreSQL Installation

  • Install PostgreSQL and the pgvector extension
  • Create a database named dev
  • Ensure the database is accessible at localhost:5432

2. Run Data Loader Scripts

Load your desired dataset using the available dataloaders:

# Yelp dataset (restaurant reviews)
python dataloaders/load_yelp_dataset.py

# Yahoo Answers classification
python dataloaders/yahoo_answer_classification.py

# Hate speech detection
python dataloaders/hate_speech_dataloader.py

# Offensive tweets classification
python dataloaders/offensive_tweets_dataset.py

# MTOP dataset
python dataloaders/mtop_dataset.py

3. Run Main Application

Execute the main ScaleLLM pipeline:

cd src
python main.py

This will:

  • Install the pgvector extension
  • Set up metadata tables
  • Run the classification pipeline on the loaded data
  • Generate embeddings and perform inference
  • Clean up temporary data

4. Launch Web Applications

Backend API:

cd src/webapp/backend
uvicorn app:app --reload --port 8000

Frontend UI:

cd src/webapp/frontend
npm install
npm run dev

The web interface will be available at http://localhost:5173 (or the port shown in the terminal).

Project Structure

scalellm/
├── src/
│   ├── main.py                 # Main application entry point
│   ├── embeddings.py           # Embedding generation and management
│   ├── generations.py          # LLM generation utilities
│   ├── webapp/                 # Web application
│   │   ├── backend/            # FastAPI backend
│   │   └── frontend/           # React frontend
│   └── models/                 # ML models and utilities
├── dataloaders/                # Dataset loading scripts
├── requirements.txt            # Python dependencies
├── docker-compose.yml          # Docker setup for PostgreSQL
└── README.md                   # This file

Usage Examples

Text Classification

ScaleLLM can classify restaurant reviews by cuisine type, detect hate speech, classify Yahoo Answers, and more. The system automatically:

  1. Generates embeddings for text data
  2. Creates a small labeled subset using LLM calls
  3. Trains a lightweight classifier
  4. Performs efficient inference on the full dataset

Cost-Accuracy Trade-offs

The system provides projections for different cost-accuracy trade-offs, allowing users to choose the optimal balance for their use case.

Citation

If you use ScaleLLM in your research, please cite:

@inproceedings{alaparthi2025scalellm,
  title={ScaleLLM: A Technique for Scalable LLM-augmented Data Systems},
  author={Alaparthi, Ashwin and Loh, Paul and Marcus, Ryan},
  booktitle={Companion of the 2025 International Conference on Management of Data (SIGMOD-Companion '25)},
  pages={1--4},
  year={2025},
  organization={ACM},
  doi={10.1145/3722212.3725130}
}

Coming Soon

A second part of this work focusing on constrained LLMs is coming soon! This extension will explore techniques for incorporating domain-specific constraints and business rules into the LLM-augmented data processing pipeline.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by research grants and computing resources from our institutions. We thank the open-source community for the tools and libraries that made this project possible.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published