This is an evolving Github repository for the paper: From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models, which is under review at ICASSP 2026. In this paper, we survey the field of Full-Duplex Spoken Language Models (FD-SLMs), which enable synchronous human–AI dialogue via simultaneous speaking and listening, achieving a more realistic human-computer interaction experience.
This is a survey, but more than a survey —— due to ICASSP's page limitation, we have omitted and abbreviated many technical details in the paper, which are highly valuable for guiding the future implementation of a Full-Duplex Spoken Language Model for practical production. Therefore, we will continue to update relevant content at this link.
If you find any mistakes, please don’t hesitate to open an issue, or contact to [email protected] directly.
Note: A modular implementation does not necessarily imply plug-and-play compatibility with other SLMs—for example, VITA-1.5 and Freeze-Omni. They are end-to-end models and can only be integrated as a whole.
In this section, we will list all existing papers on full-duplex SLMs, covering both models and benchmarks.
We define non-independent models as either prior or subsequent works from the same author team of an existing model, or fine-tuned variants built upon existing full-duplex models.
After this, let's assume we set aside the issue of Transformer, no matter how it might be implemented—such as with a dual-tower architecture (dGSLM) or token interleaving (NTPP). In an end-to-end implementation solution, we must answer another fundamental question: who serves as the system's clock for perceiving the external world?
Some may ask: traditional SLMs don't incorporate clocks, yet they still function properly. In fact, it is not that they lack this ability, but rather that they employ a more subtle method, which is the more familiar turn-taking in conversation.
We have compiled as comprehensive a list as possible of all existing datasets available for full-duplex training and provided the methods for obtaining them.
| Dataset | Lang | Scene | Access | License | Channels | Hours | Reference |
|---|---|---|---|---|---|---|---|
| AMI Meeting Corpus | EN | meeting | Free | CC BY 4.0 | 8 | ~100 | AMI (Univ. of Edinburgh) |
| ICSI Meeting Corpus | EN | meeting | Free | CC BY 4.0 | ~6 | ~70 | ICSI (Edinburgh portal) |
| ISL Meeting Speech Part 1 | EN | meeting | Paid | LDC EULA | 8 | ~10 | LDC2004S05 |
| LibriCSS | EN | meeting | Free | 7 | 10 | LibriCSS (GitHub) | |
| Fisher English | EN | phone | Paid | LDC EULA | 2 | ~1,960 | LDC2004S13 / LDC2005S13 |
| SEAME (Mandarin–English CS) | EN+ZH | interview | Paid | LDC EULA | 2 | ~192 | LDC2015S04 |
| HKUST Mandarin Telephone | ZH | phone | Paid | LDC EULA | 2 | ~149 | LDC2005S15 |
| NIST Meeting Pilot | EN | meeting | Paid | LDC EULA | ~16 | ~15 | LDC2004S09 |
| CHiME‑6 | EN | dinner‑party | Free | CC BY‑SA 4.0 | 16 | 50+ | OpenSLR SLR150 |
| DiPCo (Dinner Party Corpus) | EN | dinner‑party | Free | CDLA‑Permissive‑1.0 | 35 | ~5 | Zenodo DOI |
| AliMeeting (M2MeT) | ZH | meeting | Free | CC BY‑SA 4.0 | 8 | 118.75 | OpenSLR SLR119 |
| AISHELL‑4 | ZH | meeting | Free | 8 | ~120 | OpenSLR SLR111 | |
| MISP‑Meeting | ZH | meeting | Application | 8 | 125 | MISP 2025 Data | |
| AISHELL‑5 | ZH | in‑car | Free | CC BY‑SA 4.0 | 8 | 100+ | OpenSLR SLR159 |
| Switchboard‑1 Release 2 | EN | phone | Paid | LDC EULA | 2 | ~260 | LDC97S62 |
| Fisher Spanish Speech | ES | phone | Paid | LDC EULA | 2 | ~163 | LDC2010S01 |
| Fisher Levantine Arabic CTS | AR | phone | Paid | LDC EULA | 2 | ~45 | LDC2007S02 |
Based on FD-Bench and Full-Duplex-Bench (v1.5)—especially the latter, for which we extend special thanks to Professor Hung-yi Lee—we have developed an even more convenient benchmark built upon the engineering details of the ICASSP HumDial Challenge. Our goal is to enable as close to one-click evaluation of your model as possible and ultimately provide a quantifiable score. We name this benchmark Badcat. For details, please refer to Badcat-Benchmark/README.md.
We believe that in the age of AI, it is more important than ever to honor the foundational work of our predecessors—whose ideas can be revitalized and find new life in the era of large language models, much like how LSTM once revolutionized NLP. In the pre-LLM era, the Spoken Dialogue Systems (SDS) community had long been exploring full-duplex interaction. Therefore, we list a selection of representative works from this line of research and provide brief summaries. We encourage readers to consult the original papers to fully grasp the authors’ ideas.
If you find our survey useful for your research, please 📚cite📚 the following paper:
@article{chen2025FD-SLMs,
title={From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models},
author={Yuxuan Chen, Haoyuan Yu}
journal={arXiv preprint arXiv:2509.14515},
year={2025}
}
Update (October 31)
This repository will now receive regular updates to maintain the most current survey content.
I recently participated in the ICASSP HumDial Challenge, which temporarily delayed updates to this repository. However, this experience provided valuable insights into full-duplex implementation, particularly regarding modular approaches.
Additionally, several recent full-duplex papers have been published, such as FLM-Audio[2509.02521], and will be incorporated into upcoming updates.
