This is the official data and code repository for the paper Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol.
The full benchmark is available in the data
directory. Full text of the papers can be downloaded at this link.
Each data entry looks like:
{
"table_id": "...",
"table_query_id": ...,
"table_caption": "...",
"user_demand": "...",
"table": {
"MR": {...},
"NLG": {...}
},
"input_references_id": [...],
"input_references_label": [...],
"table_references": [...],
"sampled_noisy_paper": [...]
}
table_id
: A unique identifier for this specific table instance.table_query_id
: An integer representing the ID of the user query or information need. Multiple entries can share the same query ID if they stem from the same user instruction.table_caption
: A human-readable caption that describes the table content or theme. Mimics captions typically found in research papers.user_demand
: The detailed user intent in natural language. It specifies what the table should demonstrate or convey.table
: The core content of the benchmark entry, can be loaded via pandas.DataFrame.input_references_id
: A list of all paper reference IDs retrieved or considered for the table construction.input_references_label
: A list of binary labels (1 for relevant, 0 for irrelevant) corresponding toinput_references_id
, indicating whether each paper was selected for table inclusion.table_references
: A subset ofinput_references_id
that were included in the final table (i.e., where the label is 1).sampled_noisy_paper
: A sample of papers labeled as irrelevant (0) included to simulate distractor papers. Useful for evaluating the robustness of LLM-based table generation against noisy retrieval results.
Code for constructing the benchmark is available in the benchmark_construction
directory.
To reproduce the benchmark:
- Download the arxivDIGESTables dataset, which is available at this link.
- Collect user demands of generating the tables based on table caption via
user_intention_rewriting.py
. - Run
distractor_paper_embedding.py
to generate embeddings for the papers in the dataset. - Run
distractor_paper_candidate_selection.py
to select distractor paper candidates. - Annotate them via human experts to determine their relevance, then merge the labels into the benchmark.
Code for parsing papers from arXiv and ACL Anthology are also made available as parse_arxiv_paper.py
and parse_acl_paper.py
, respectively.
We also release the code to reproduce all baselines and our proposed method. The code is organized into two main directories:
baseline
: Code for reproducing the baselines.method
: Code for our proposed iterative batch-based method.
To evaluate the generated tables, we provide a script to synthesize QA pairs from a table and ask LLMs to answer them based on the other table.
The evaluation script is located in the evaluation
directory.
If you wish to use our code or data, please cite our paper at:
@misc{wang2025llmsgeneratetabularsummaries,
title={Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol},
author={Weiqi Wang and Jiefu Ou and Yangqiu Song and Benjamin Van Durme and Daniel Khashabi},
year={2025},
eprint={2504.10284},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.10284},
}