TuRTLe is a framework to assess LLMs across key RTL generation tasks systematically. It integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion.
This work extends the functionality and flexibility of bigcode-evaluation-harness with the use of open-source EDA tools to run Specification-to-RTL and RTL Code Completion benchmarks. Furthermore, it is inspired from vllm-code-harness to allow an efficient inference with vLLM.
Benchmarks implemented so far are:
- VerilogEval v2.0: Specification-to-RTL and Module Completion
- RTLLM v1.1 and v2.0: Specification-to-RTL
- VGen: Module Completion
- RTL-Repo: Single Line Completion
Open-source EDA tools integrated:
- Icarus Verilog: syntax and functionality
- Verilator: syntax and functionality
- Yosys: synthesis
- OpenROAD: PPA
- OpenLane: to integrate YoSys and OpenROAD
For more details about our work, refer to our ArXiv paper. Here you have a diagram of the high-level structure of the framework:
- [2025-07-03] TuRTLe now supports Verilator as a simulator to check for Syntax and Functionality
- [2025-06-12] We add support for multi-node inference with Ray and the configurations for bigger models
- [2025-05-19] The project’s source code is now publicly released. We’d love to hear your feedback, so give it a try!
- [2025-03-31] Our paper "TuRTLe: A Unified Evaluation of LLMs for RTL Generation" is now available on ArXiv!
- [2025-03-20] The leaderboard is now live! Check it out on our Huggingface Space
- [In progress] Release repo compatible with local execution
Check the TuRTLe Leaderboard to know the best open-source models for each task.
Warning
Dependencies Notice
vLLM currently supports up to Python 3.12. Ensure that your Python version does not exceed this limit to avoid compatibility issues.
Most of the modes require to be executed in HPC environments. For this reason, TuRTLe currently relies on Slurm and Singularity for its execution.
-
Clone the repository:
git clone --recursive https://github.com/HPAI-BSC/TuRTLe.git
-
(Optional) Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install Python dependencies:
pip install -r requirements.txt
On non-Linux devices the above command will raise:
AssertionError: vLLM only supports Linux platform (including WSL).
In this case, vLLM has to be installed from source (see their installation page for details).
-
Install bigcode-evaluation-harness as a pypi package:
cd TuRTLe/bigcode-evaluation-harness/ pip install -e .
-
Intall EDA Tools (not required for single line completion benchmarks)
To install OpenLane, follow the instructions provided in the OpenLane Installation Guide.
To install ICARUS Verilog on Windows check the Icarus Verilog Windows download page. To install it on Linux execute:
sudo apt-get update sudo apt-get install iverilog
Finally, we recommend using Singularity for containerization on HPC environments. TuRTLe can dynamically create and submit Slurm job script. To enable this, include the following settings in your benchmark configuration file:
- singularity_image: path to your singularity image.
- For each model, specify a slurm_config from
turtle/configs/slurm.yml
with the slurm directives to run the benchmark.
To execute the project, use the turtle/run.py
script with the appropriate arguments. Below are the details of the available parameters:
python turtle/run.py [--benchmark <config_file>] [--model <model_name>] [--run_all]
If the configuration file includes both singularity_image
and slurm_config
, TuRTLe will automatically generate and execute a Slurm script to run the benchmark using the specified Singularity image.
--benchmark
: Name of the .yml file inturtle/configs/
with the configurations of the benchmark to run (e.g.,rtlrepo
,rtllm
,verilog_eval_cc
,verilog_eval_rtl
,verigen
).--model
: Specify a particular model to run. If not provided, all models in the configuration file will be executed.--run_all
: Use this flag to run all benchmarks against all models.
Due to the dual-image setup, one for inference and another including EDA tools (e.g., Icarus Verilog, Verilator, Yosys, OpenLane), you can control each phase of the pipeline separately:
--generation_only
: Use this flag to only perform inference.--evaluation_only
: Use this flag to only perform evaluation. We load the generations automatically from the YAMLmetric_output_path
variable
-
Run all models specified in the configuration file for the RTL-Repo benchmark:
python turtle/run.py --benchmark rtlrepo
-
Test Qwen2.5-32B against the benchmark VerilogEval Code Completion:
python turtle/run.py --benchmark verilog_eval_cc --model Qwen2.5-32B
-
Run all benchmarks against all models:
python turtle/run.py --run_all
The process to implement a benchmark is very similar to the one described by bigcode-evaluation-harness guide. Follow these steps:
- Copy the
turtle/tasks/template/new_task.py
intoturtle/tasks/
and rename it to the name of your benchmark<benchmark_name>.py
. - Complete all the TODO comments in the template file.
- Define a configuration file named
turtle/configs/<benchmark_name>.yml
and list the models you want to evaluate along with their required parameters. - Update the
_load_new_modules()
and_create_extended_registry()
methods withinturtle/src/utils/task_updater.py
.
@inproceedings{garciagasulla2025turtleunifiedevaluationllms,
title={TuRTLe: A Unified Evaluation of LLMs for RTL Generation},
author={Dario Garcia-Gasulla and Gokcen Kestor and Emanuele Parisi and Miquel Albert\'i-Binimelis and Cristian Gutierrez and Razine Moundir Ghorab and Orlando Montenegro and Bernat Homs and Miquel Moreto},
booktitle = {Proceedings of the 2025 ACM/IEEE International Symposium on Machine Learning for CAD},
series = {MLCAD '25}
year={2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
location = {Santa Cruz, CA, USA},
url={https://arxiv.org/abs/2504.01986},
}
Any contribution is more than welcome! If you've found a bug or have an idea for an improvement, don't hesitate to open a new issue using our issue forms. We also encourage people to do pull requests with new benchmarks of any task relevant for chip design.
If you have any questions or feedback, feel free to email us at [email protected]. You can also support the project by following or starring the repository.
Made with ❤️ by HPAI at the Barcelona Supercomputing Center (BSC)