Clone the repository and install dependencies via uv (recommended):
git clone https://github.com/VIDA-NYU/AutoDDG.git
cd AutoDDG
uv sync
# If you do not have uv installed:
# * `curl -LsSf https://astral.sh/uv/install.sh | sh`
# * or look at https://docs.astral.sh/uv/getting-started/installation/Then launch Jupyter Lab to explore:
uv run --with jupyter jupyter labAlternatively, install directly via pip:
pip install git+https://github.com/VIDA-NYU/AutoDDG@mainCaution
This installation method is temporary. A PyPI release of AutoDDG will soon be available. The git+https method will be deprecated in favor of the PyPI index.
A very basic way to use AutoDDG:
The simplest way to use AutoDDG is to create an instance and generate a dataset description:
from openai import OpenAI
from autoddg import AutoDDG
# Setup OpenAI client
client = OpenAI(api_key="sk-...")
# Initialize AutoDDG
autoddg = AutoDDG(client=client, model_name="gpt-4o-mini")
# Generate description from a small CSV sample
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""
prompt, description = autoddg.generate_description(dataset_sample=sample_csv)
print(description)
# >>> This dataset contains medical information about patients, including their unique Case_ID, Age, and Body Mass Index (BMI). etc.For a much better introduction, we highly recommend starting with the quick_start notebook with an example dataset.
If you use AutoDDG in your research, please cite our work:
@misc{2502.01050,
Author = {Haoxiang Zhang and Yurong Liu and Wei-Lun Hung and Aécio Santos and Juliana Freire},
Title = {AutoDDG: Automated Dataset Description Generation using Large Language Models},
Year = {2025},
Eprint = {arXiv:2502.01050},
}AutoDDG is released under the Apache License 2.0.