Skip to content

Automated Dataset Description Generation using Large Language Models

License

VIDA-NYU/AutoDDG

Repository files navigation

AutoDDG

Automated Dataset Description Generation using Large Language Models

submitted to VLDB 2025

ArXiv Extended Paper Version

UV Ruff Black formatted Python >= 3.10 OpenAI


Installation

Clone the repository and install dependencies via uv (recommended):

git clone https://github.com/VIDA-NYU/AutoDDG.git
cd AutoDDG
uv sync
# If you do not have uv installed:
# * `curl -LsSf https://astral.sh/uv/install.sh | sh`
# * or look at https://docs.astral.sh/uv/getting-started/installation/

Then launch Jupyter Lab to explore:

uv run --with jupyter jupyter lab

Alternatively, install directly via pip:

pip install git+https://github.com/VIDA-NYU/AutoDDG@main

Caution

This installation method is temporary. A PyPI release of AutoDDG will soon be available. The git+https method will be deprecated in favor of the PyPI index.


Getting Started

A very basic way to use AutoDDG:

Getting Started

The simplest way to use AutoDDG is to create an instance and generate a dataset description:

from openai import OpenAI
from autoddg import AutoDDG

# Setup OpenAI client
client = OpenAI(api_key="sk-...")

# Initialize AutoDDG
autoddg = AutoDDG(client=client, model_name="gpt-4o-mini")

# Generate description from a small CSV sample
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""

prompt, description = autoddg.generate_description(dataset_sample=sample_csv)

print(description)
# >>> This dataset contains medical information about patients, including their unique Case_ID, Age, and Body Mass Index (BMI). etc.

Quick Jupyter Notebook Start

For a much better introduction, we highly recommend starting with the quick_start notebook with an example dataset.


How to Cite

If you use AutoDDG in your research, please cite our work:

@misc{2502.01050,
Author = {Haoxiang Zhang and Yurong Liu and Wei-Lun Hung and Aécio Santos and Juliana Freire},
Title = {AutoDDG: Automated Dataset Description Generation using Large Language Models},
Year = {2025},
Eprint = {arXiv:2502.01050},
}

License

AutoDDG is released under the Apache License 2.0.

About

Automated Dataset Description Generation using Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages