GitHub - HPAI-BSC/TuRTLe: TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

TuRTLe is a framework to assess LLMs across key RTL generation tasks systematically. It integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion.

This work extends the functionality and flexibility of bigcode-evaluation-harness with the use of open-source EDA tools to run Specification-to-RTL and RTL Code Completion benchmarks. Furthermore, it is inspired from vllm-code-harness to allow an efficient inference with vLLM.

Benchmarks implemented so far are:

VerilogEval v2.0: Specification-to-RTL and Module Completion
RTLLM v1.1 and v2.0: Specification-to-RTL
VGen: Module Completion
RTL-Repo: Single Line Completion

Open-source EDA tools integrated:

Icarus Verilog: syntax and functionality
Verilator: syntax and functionality
Yosys: synthesis
OpenROAD: PPA
OpenLane: to integrate YoSys and OpenROAD

For more details about our work, refer to our ArXiv paper. Here you have a diagram of the high-level structure of the framework:

News

[2025-07-03] TuRTLe now supports Verilator as a simulator to check for Syntax and Functionality
[2025-06-12] We add support for multi-node inference with Ray and the configurations for bigger models
[2025-05-19] The project’s source code is now publicly released. We’d love to hear your feedback, so give it a try!
[2025-03-31] Our paper "TuRTLe: A Unified Evaluation of LLMs for RTL Generation" is now available on ArXiv!
[2025-03-20] The leaderboard is now live! Check it out on our Huggingface Space

Road Map

[In progress] Release repo compatible with local execution

Leaderboard 🥇

Check the TuRTLe Leaderboard to know the best open-source models for each task.

Usage

Warning

Dependencies Notice
vLLM currently supports up to Python 3.12. Ensure that your Python version does not exceed this limit to avoid compatibility issues.

HPC Environment Requirements

Most of the modes require to be executed in HPC environments. For this reason, TuRTLe currently relies on Slurm and Singularity for its execution.

Installation

Clone the repository:

git clone --recursive https://github.com/HPAI-BSC/TuRTLe.git

(Optional) Create and activate a virtual environment:
```
python3 -m venv venv
source venv/bin/activate
```
Install Python dependencies:
```
pip install -r requirements.txt
```
On non-Linux devices the above command will raise:
```
AssertionError: vLLM only supports Linux platform (including WSL).
```
In this case, vLLM has to be installed from source (see their installation page for details).

Install bigcode-evaluation-harness as a pypi package:

cd TuRTLe/bigcode-evaluation-harness/
pip install -e .

Intall EDA Tools (not required for single line completion benchmarks)

To install OpenLane, follow the instructions provided in the OpenLane Installation Guide.

To install ICARUS Verilog on Windows check the Icarus Verilog Windows download page. To install it on Linux execute:
```
sudo apt-get update
sudo apt-get install iverilog
```

Finally, we recommend using Singularity for containerization on HPC environments. TuRTLe can dynamically create and submit Slurm job script. To enable this, include the following settings in your benchmark configuration file:

singularity_image: path to your singularity image.
For each model, specify a slurm_config from turtle/configs/slurm.yml with the slurm directives to run the benchmark.

Running the Project

To execute the project, use the turtle/run.py script with the appropriate arguments. Below are the details of the available parameters:

python turtle/run.py [--benchmark <config_file>] [--model <model_name>] [--run_all]

If the configuration file includes both singularity_image and slurm_config, TuRTLe will automatically generate and execute a Slurm script to run the benchmark using the specified Singularity image.

Core Parameters

--benchmark: Name of the .yml file in turtle/configs/ with the configurations of the benchmark to run (e.g., rtlrepo, rtllm, verilog_eval_cc, verilog_eval_rtl, verigen).
--model: Specify a particular model to run. If not provided, all models in the configuration file will be executed.
--run_all: Use this flag to run all benchmarks against all models.

Additional Parameters

Due to the dual-image setup, one for inference and another including EDA tools (e.g., Icarus Verilog, Verilator, Yosys, OpenLane), you can control each phase of the pipeline separately:

--generation_only: Use this flag to only perform inference.
--evaluation_only: Use this flag to only perform evaluation. We load the generations automatically from the YAML metric_output_path variable

Examples

Run all models specified in the configuration file for the RTL-Repo benchmark:
```
python turtle/run.py --benchmark rtlrepo 
```

Test Qwen2.5-32B against the benchmark VerilogEval Code Completion:

python turtle/run.py --benchmark verilog_eval_cc --model Qwen2.5-32B

Run all benchmarks against all models:
```
python turtle/run.py --run_all
```

Add your benchmark

The process to implement a benchmark is very similar to the one described by bigcode-evaluation-harness guide. Follow these steps:

Copy the turtle/tasks/template/new_task.py into turtle/tasks/ and rename it to the name of your benchmark <benchmark_name>.py.
Complete all the TODO comments in the template file.
Define a configuration file named turtle/configs/<benchmark_name>.yml and list the models you want to evaluate along with their required parameters.
Update the _load_new_modules() and _create_extended_registry() methods within turtle/src/utils/task_updater.py.

Citation

@inproceedings{garciagasulla2025turtleunifiedevaluationllms,
      title={TuRTLe: A Unified Evaluation of LLMs for RTL Generation}, 
      author={Dario Garcia-Gasulla and Gokcen Kestor and Emanuele Parisi and Miquel Albert\'i-Binimelis and Cristian Gutierrez and Razine Moundir Ghorab and Orlando Montenegro and Bernat Homs and Miquel Moreto},
      booktitle = {Proceedings of the 2025 ACM/IEEE International Symposium on Machine Learning for CAD},
      series = {MLCAD '25}
      year={2025},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      location = {Santa Cruz, CA, USA},
      url={https://arxiv.org/abs/2504.01986}, 
}

How to contribute 🤝

Any contribution is more than welcome! If you've found a bug or have an idea for an improvement, don't hesitate to open a new issue using our issue forms. We also encourage people to do pull requests with new benchmarks of any task relevant for chip design.

Contact

If you have any questions or feedback, feel free to email us at [email protected]. You can also support the project by following or starring the repository.

Made with ❤️ by HPAI at the Barcelona Supercomputing Center (BSC)

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github		.github
bigcode-evaluation-harness @ 6116c6a		bigcode-evaluation-harness @ 6116c6a
images		images
turtle		turtle
.gitmodules		.gitmodules
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News

Road Map

Leaderboard 🥇

Usage

HPC Environment Requirements

Installation

Running the Project

Core Parameters

Additional Parameters

Examples

Add your benchmark

Citation

How to contribute 🤝

Contact

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

HPAI-BSC/TuRTLe

Folders and files

Latest commit

History

Repository files navigation

News

Road Map

Leaderboard 🥇

Usage

HPC Environment Requirements

Installation

Running the Project

Core Parameters

Additional Parameters

Examples

Add your benchmark

Citation

How to contribute 🤝

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages