GitHub

PACEbench · A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

🚀 Introduction

PACEbench is an automated benchmarking and comparison platform for penetration-testing agents and security tools. It provides multi-category security tasks, automated environment provisioning, port conflict resolution with flag generation/injection, external agent integration protocols, and automated scoring with result aggregation.

Multi-Category Tasks: cve / multiple_host / full_chain / defense four major categories
One-Click Environments: Automatically locate and start corresponding docker-compose environments with accessible URLs
Port Management: Placeholder port ${VAR:-DEFAULT} default value substitution with conflict detection, persisted to config/port_config.json
Flag Management: Generate random flag{...} according to task rules, pre-deploy to environment build paths, unified archive in flags/flags.json
Agent Integration: Minimal protocol support with /model, /chat, /result, /stop endpoints
Benchmark Orchestration: Single task/category/full batch testing, automatic flag extraction/comparison, tabular summary and logging output

✨ Quick Start

1) Prerequisites

Python 3.x
Docker and Docker Compose (V2 preferred, V1 compatible)

2) Installation

pip install -r requirements.txt

3) Common Commands

Show datasets/tasks

python3 main.py show
python3 main.py show --task <TASK_ID_OR_NAME>

Start environment (by task or category)

python3 main.py env --task <TASK_ID_OR_NAME>
python3 main.py env --dataset <CATEGORY>

Run benchmark (single task/category/all)

# Specify model: name or index (see supported models below)
python3 main.py benchmark --task <TASK_ID_OR_NAME> --model <MODEL_OR_INDEX>
python3 main.py benchmark --dataset <CATEGORY> --model <MODEL_OR_INDEX>
python3 main.py benchmark --model <MODEL_OR_INDEX>

Check ports and conflicts
```
python3 main.py ports
```

Cleanup environments

python3 main.py cleanup --task <TASK_ID_OR_NAME>
python3 main.py cleanup --dataset <CATEGORY>
# Optional: keep images or volumes
python3 main.py cleanup --task <TASK_ID_OR_NAME> --keep-images
python3 main.py cleanup --task <TASK_ID_OR_NAME> --keep-volumes

View/fix flag-related issues

python3 main.py flags --task <TASK_ID_OR_NAME>
python3 main.py cleanup-flags

4) Model Selection

# List supported models (from config/models.json)
python3 main.py show --model

# Specify by index or name
python3 main.py benchmark --task NAIVE-WAF --model 2
python3 main.py benchmark --task NAIVE-WAF --model deepseek-v3

🧩 Architecture & Directory Structure

PACEbench/
├── main.py                # CLI entrypoint
├── data/
│   └── datasets.json      # Complete task registry
├── docker/                # Target environments (categorized)
│   ├── cve/
│   ├── defense/
│   ├── MultiHost/
│   └── FullChain/
├── utils/                 # Automation core
│   ├── docker_manager.py  # Environment orchestration, compose handling, flag generation/injection, start/stop/cleanup
│   ├── port_manager.py    # Port allocation and conflict detection, persistence
│   ├── workflow_manager.py# Benchmark orchestration, result statistics and logging
│   └── dataset_manager.py # Dataset registration/query
├── config/
│   ├── models.json        # Supported models list
│   └── port_config.json   # Allocated ports (can be manually overridden)
├── results/
│   └── logs/              # Benchmark log output directory
├── flags/                 # Dynamically generated flags and flags.json
├── docs/                  # Documentation
│   ├── agent_server_protocol.md
│   └── CONTRIBUTING_TASK.md
└── README.md

🔗 Agent Integration Protocol (Minimal Implementation)

PACEbench interacts with external Agent services via HTTP (default http://localhost:8000). For detailed protocol specification, see docs/agent_server_protocol.md.

Implement the following endpoints to complete integration:

POST /model: Select/set model
```
{"model": "deepseek-v3"}
```
POST /chat: Send target and constraints (internal assembly of target host URL list and prompts)
```
{"prompt": "...Final Goal & Target hosts..."}
```
GET /result: Poll progress and results (supports step/total/duration/total_tokens/total_cost/logfile/flag/report)
POST /stop: End task

Benchmark flow automatically: Generate/inject flags → Start target environment → Poll results → Extract flag{...} and score → Save logs to results/logs/<timestamp>_<model>/.

🏁 Tasks & Flag Mechanism (Summary)

Register tasks in data/datasets.json, declare:
- category, environment, ports (using ${VAR:-DEFAULT} placeholders)
- flag_type: sql / file / mixed
- mixed uses flag_locations to specify multiple flag type and flag_path
Before benchmark starts automatically:
- Generate random flags → Write to flags/ and flags/flags.json
- Copy flag files to environment directory flag_path (e.g., docker-entrypoint-initdb.d/*.sql)
Port default values only replace DEFAULT in ${VAR:-DEFAULT}, FullChain class injects via environment variables without modifying compose file content.

For details and best practices, see docs/CONTRIBUTING_TASK.md.

🤝 Contributing

Refer to docs/CONTRIBUTING_TASK.md to add new tasks; ensure compose ports use placeholders and provide clear flag_path.
Welcome Issues/PRs to discuss improvements to automation scripts and dataset quality.

📖 Citation

If you use PACEbench in your research, please cite our paper:

@misc{liu2025pacebench,
      title={PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities}, 
      author={Zicheng Liu and Lige Huang and Jie Zhang and Dongrui Liu and Yuan Tian and Jing Shao},
      year={2025},
      eprint={2510.11688},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2510.11688}, 
}

📎 Related Resources

Paper: PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities
Project Website: https://pacebench.github.io/

If you have any questions about using or extending PACEbench, welcome to submit Issues for discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Introduction

✨ Quick Start

1) Prerequisites

2) Installation

3) Common Commands

4) Model Selection

🧩 Architecture & Directory Structure

🔗 Agent Integration Protocol (Minimal Implementation)

🏁 Tasks & Flag Mechanism (Summary)

🤝 Contributing

📖 Citation

📎 Related Resources

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
data		data
docker		docker
docs		docs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

RyuKosei/PACEbench

Folders and files

Latest commit

History

Repository files navigation

🚀 Introduction

✨ Quick Start

1) Prerequisites

2) Installation

3) Common Commands

4) Model Selection

🧩 Architecture & Directory Structure

🔗 Agent Integration Protocol (Minimal Implementation)

🏁 Tasks & Flag Mechanism (Summary)

🤝 Contributing

📖 Citation

📎 Related Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages