Large language models (LLMs) offer powerful semantic insights for data analytics, but row-by-row LLM calls quickly become prohibitively expensive in large datasets. ScaleLLM is a novel system that substantially reduces both latency and cost on text classification tasks by coupling LLM-generated labels on a small subset of data with a lightweight machine learning model for large-scale inference.
This approach provides significant speed-ups—up to 37×—while maintaining accuracy close to that of a full LLM baseline, converging within 1% of its accuracy in several tasks. ScaleLLM also provides cost-accuracy trade-off projections, giving users fine-grained control over performance trade-offs.
- Efficient Inference: Up to 37× speed-up compared to full LLM baselines
- Cost Optimization: Significant reduction in API costs while maintaining accuracy
- Embedding Views: Reusable embedding representations for efficient querying
- Web Interface: Visual UI for exploring and analyzing results
- Multiple Datasets: Support for various text classification tasks including:
- Yelp restaurant reviews classification
- Yahoo Answers classification
- Hate speech detection
- Offensive tweets classification
- MTOP dataset processing
- Python 3.10
- PostgreSQL with pgvector extension
- Node.js (for frontend)
-
Clone the repository:
git clone <repository-url> cd scalellm
-
Install Python dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory with your configuration:# Database configuration DATABASE_URL=postgresql://postgres:postgres@localhost:5432/dev # OpenAI API key (required for LLM operations) OPENAI_API_KEY=your_openai_api_key_here
Option A: Using Docker Compose (Recommended)
docker-compose up -d postgresThis will start PostgreSQL with pgvector extension on port 5432.
Option B: Local PostgreSQL Installation
- Install PostgreSQL and the pgvector extension
- Create a database named
dev - Ensure the database is accessible at
localhost:5432
Load your desired dataset using the available dataloaders:
# Yelp dataset (restaurant reviews)
python dataloaders/load_yelp_dataset.py
# Yahoo Answers classification
python dataloaders/yahoo_answer_classification.py
# Hate speech detection
python dataloaders/hate_speech_dataloader.py
# Offensive tweets classification
python dataloaders/offensive_tweets_dataset.py
# MTOP dataset
python dataloaders/mtop_dataset.pyExecute the main ScaleLLM pipeline:
cd src
python main.pyThis will:
- Install the pgvector extension
- Set up metadata tables
- Run the classification pipeline on the loaded data
- Generate embeddings and perform inference
- Clean up temporary data
Backend API:
cd src/webapp/backend
uvicorn app:app --reload --port 8000Frontend UI:
cd src/webapp/frontend
npm install
npm run devThe web interface will be available at http://localhost:5173 (or the port shown in the terminal).
scalellm/
├── src/
│ ├── main.py # Main application entry point
│ ├── embeddings.py # Embedding generation and management
│ ├── generations.py # LLM generation utilities
│ ├── webapp/ # Web application
│ │ ├── backend/ # FastAPI backend
│ │ └── frontend/ # React frontend
│ └── models/ # ML models and utilities
├── dataloaders/ # Dataset loading scripts
├── requirements.txt # Python dependencies
├── docker-compose.yml # Docker setup for PostgreSQL
└── README.md # This file
ScaleLLM can classify restaurant reviews by cuisine type, detect hate speech, classify Yahoo Answers, and more. The system automatically:
- Generates embeddings for text data
- Creates a small labeled subset using LLM calls
- Trains a lightweight classifier
- Performs efficient inference on the full dataset
The system provides projections for different cost-accuracy trade-offs, allowing users to choose the optimal balance for their use case.
If you use ScaleLLM in your research, please cite:
@inproceedings{alaparthi2025scalellm,
title={ScaleLLM: A Technique for Scalable LLM-augmented Data Systems},
author={Alaparthi, Ashwin and Loh, Paul and Marcus, Ryan},
booktitle={Companion of the 2025 International Conference on Management of Data (SIGMOD-Companion '25)},
pages={1--4},
year={2025},
organization={ACM},
doi={10.1145/3722212.3725130}
}A second part of this work focusing on constrained LLMs is coming soon! This extension will explore techniques for incorporating domain-specific constraints and business rules into the LLM-augmented data processing pipeline.
We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
This work was supported by research grants and computing resources from our institutions. We thank the open-source community for the tools and libraries that made this project possible.