AutoDDG

Automated Dataset Description Generation using Large Language Models

submitted to VLDB 2025

Installation

Clone the repository and install dependencies via uv (recommended):

git clone https://github.com/VIDA-NYU/AutoDDG.git
cd AutoDDG
uv sync
# If you do not have uv installed:
# * `curl -LsSf https://astral.sh/uv/install.sh | sh`
# * or look at https://docs.astral.sh/uv/getting-started/installation/

Then launch Jupyter Lab to explore:

uv run --with jupyter jupyter lab

Alternatively, install directly via pip:

pip install git+https://github.com/VIDA-NYU/AutoDDG@main

Caution

This installation method is temporary. A PyPI release of AutoDDG will soon be available. The git+https method will be deprecated in favor of the PyPI index.

Getting Started

A very basic way to use AutoDDG:

Getting Started

The simplest way to use AutoDDG is to create an instance and generate a dataset description:

from openai import OpenAI
from autoddg import AutoDDG

# Setup OpenAI client
client = OpenAI(api_key="sk-...")

# Initialize AutoDDG
autoddg = AutoDDG(client=client, model_name="gpt-4o-mini")

# Generate description from a small CSV sample
sample_csv = """Case_ID,Age,BMI
C3L-00004,72,22.8
C3L-00010,30,34.15
"""

prompt, description = autoddg.generate_description(dataset_sample=sample_csv)

print(description)
# >>> This dataset contains medical information about patients, including their unique Case_ID, Age, and Body Mass Index (BMI). etc.

Quick Jupyter Notebook Start

For a much better introduction, we highly recommend starting with the quick_start notebook with an example dataset.

How to Cite

If you use AutoDDG in your research, please cite our work:

@misc{2502.01050,
Author = {Haoxiang Zhang and Yurong Liu and Wei-Lun Hung and Aécio Santos and Juliana Freire},
Title = {AutoDDG: Automated Dataset Description Generation using Large Language Models},
Year = {2025},
Eprint = {arXiv:2502.01050},
}

License

AutoDDG is released under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
examples		examples
src/autoddg		src/autoddg
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

AutoDDG

Automated Dataset Description Generation using Large Language Models

submitted to VLDB 2025

Installation

Getting Started

Getting Started

Quick Jupyter Notebook Start

How to Cite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Uh oh!

License

Uh oh!

VIDA-NYU/AutoDDG

Folders and files

Latest commit

History

Repository files navigation

AutoDDG

Automated Dataset Description Generation using Large Language Models

submitted to VLDB 2025

Installation

Getting Started

Getting Started

Quick Jupyter Notebook Start

How to Cite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages