Learning Protein Conformation and Dynamics through Autoregression

Introduction

Proteins are flexible molecules, and capturing their dynamics is essential for understanding their functions. Molecular dynamics (MD) simulations model these dynamics by sampling conformational ensembles from an underlying energy landscape defined by physical force fields, providing rich data for studying protein conformational behavior.

An illustration of protein dynamics generated by MD simulation (left), by sampling from the energy landscape defined by a physical model (right).

ConfRover is a deep generative model that learns to produce dynamic protein trajectories directly from MD data. It generates conformational frames in a trajectory autoregressively, sampling each next frame conditioned on historical context. Building on a causal transformer architecture widely used in language models, ConfRover enables efficient training and jointly learns the distribution of protein conformations and their temporal evolution at coarse time steps, offering a fast proxy for expensive MD simulations.

By modeling different dependency patterns, ConfRover supports various tasks:

Tasks	Input condition
Forward simulation	amino acid sequence $\mathbf{s}$, starting frame $\mathbf{x}_1$,number of frames $L$, time interval (stride) $\Delta t$
IID sampling	amino acid sequence $\mathbf{s}$, number of samples $N$
State interpolation	amino acid sequence $\mathbf{s}$, starting frame $\mathbf{x}_1$, ending frame $\mathbf{x}_2$, number of frames $L$, time interval (stride) $\Delta t$

See our paper and website for more details.

(A) ConfRover unifies the learning of conformational distributions and their temporal evolution through frame-level conditioning, enabling multiple tasks. (B) Its general autoregressive formulation captures temporal dependencies directly from data using a causal transformer.

Updates

[2025-11] ConfRover v1.0 released!
[2025-09] ConfRover is accepted to NeurIPS 2025!

🔔 Stay Updated

ConfRover is under active development. Click Watch to follow new updates and releases.

Pretrained models

We provide a list of pretrained models from the ConfRover family:

Model	Best for	Downloaded Checkpoints
`ConfRover-base-20M-v1.0`	Forward simulation and IID sampling
`ConfRover-interp-20M-v1.0`	State interpolation

Pretrained model checkpoints can also be downloaded through ConfRover.from_pretrained(model_name).

Quick Start

Installation

# [recommended] use conda environment
conda create -n confrover python=3.10
conda activate confrover

# clone ConfRover repository
git clone https://github.com/ByteDance-Seed/ConfRover.git
cd ConfRover

# first install confrover and other dependencies, then openfold (requires torch pre-installed)
pip install . && pip install --no-build-isolation .[openfold]

ConfRover has been tested on NVIDIA H100 with CUDA 12.6

Python API

ConfRover is installed as a Python package and provides a simple API for generating conformations or trajectories for a single case. See following snippet and examples\ folder:

from confrover.model import ConfRover

# Load pretrained model
model = ConfRover.from_pretrained("ConfRover-base-20M-v1.0") # see method for optional arguments

# Move to GPU
model.to("cuda:0")

# Task 1: forward simulation
model.generate(
    case_id="6j56_A",
    seqres="ARQREIEMNRQQRFFRIPFIRPADQYKDPQSKKKGWWYAHFDGPWIARQMELHPDKPPILLVAGKDDMEMCELNLEETGLTRKRGAEILPRQFEEIWERCGGIQYLQNAIESRQARPTYATAMLQSLLK",
    task_mode="forward",
    output_dir="/path/to/output/fwd/",
    n_replicates=1,
    n_frames=10, # total number of frames (including the starting frame)
    stride_in_10ps=256, # time interval between frames in the unit of 10 ps.
    conditions="/path/to/examples/6j56_A_start.pdb", # start frame
)

# Task 2: Independent ensemble sampling
model.generate(
    case_id="6j56_A",
    seqres="ARQREIEMNRQQRFFRIPFIRPADQYKDPQSKKKGWWYAHFDGPWIARQMELHPDKPPILLVAGKDDMEMCELNLEETGLTRKRGAEILPRQFEEIWERCGGIQYLQNAIESRQARPTYATAMLQSLLK",
    task_mode="iid",    
    output_dir="/path/to/output/iid/",
    n_replicates=50, # number of conformation samples
)

# Task 3: interpolating two conformations
model.generate(
    case_id="6j56_A",
    seqres="ARQREIEMNRQQRFFRIPFIRPADQYKDPQSKKKGWWYAHFDGPWIARQMELHPDKPPILLVAGKDDMEMCELNLEETGLTRKRGAEILPRQFEEIWERCGGIQYLQNAIESRQARPTYATAMLQSLLK",
    task_mode="interp",
    output_dir="/path/to/output/interp/",
    n_replicates=5,
    n_frames=9,
    stride_in_10ps=256,
    conditions = [
        "/path/to/examples/6j56_A_start.pdb",
        "/path/to/examples/6j56_A_end.pdb",
    ],
)

Method ConfRover.generate() is designed for simple runs or integrating ConfRover into customized pipelines. For batch generation, we recommend using the command line interface.

Command line interface

ConfRover provides a command line interface for parallel generation over multiple GPUs. A `.json`` manifest file is required to specify the generation tasks and cases.

confrover generate \
    --job_config <path/to/job_manifest.json> \
    --output <path/to/output_dir> \
    --model <model_name/weight_path> \
    [...] 

# See `confrover generate --help` for detailed arguments.

Input manifest format

ConfRover uses JSON files to define generation tasks for forward simulation, interpolation, and IID sampling. The file specifies basic dataset information and a list of cases, each describing the protein name, animo acid sequence, and optional conditioning frames for trajectory generation. Conditioning frames can be provided from conformations in .pdb files or from specific frames in an .xtc trajectory file (using frame indices). Following examples show the format to define each generation jobs.

Forward simulation: generate protein motion trajectories from an initial conformation ("condition") at a specified stride.

{
    "name": "job_name", 
    "task_mode": "forward",
    "n_replicates": 1, // <int> number of replicated trajectories for each case.
    "n_frames": 100, // <int> number of frames in each generated trajectory (including the conditioning frame).
    "stride_in_10ps": 120, // <int> interval between frames in the unit of 10 ps.
    "cases": [
        // Option 1: starting from a pair of .pdb files
        {
            "case_id": "7jfl_C", // case_id must be unique
            "seqres": "SALQDLLRTLKSPSSPQQQQQVLNILKSNPQLMAAFIKQRTAKYVAN", // amino acid sequence
            "conditions": "/path/to/7jfl_C.pdb" // <str> path to the starting .pdb file
        },
        // Option 2: starting from a time frame defined in a .xtc file
        {
            "case_id": "7lp1_A",
            "seqres": "VTQSFLPPGWEMRIAPNGRPFFIDHNTKTTTWEDPRLKF",
            "conditions": {
                "xtc_fpath": "/path/to/7lp1_A.xtc", // <str> .xtc file contains trajectory information
                "pdb_fpath": "/path/to/7lp1_A.pdb", // <str> corresponding .pdb file contaisn the molecule topology
                "frame_idxs": 1000 // <int> time frame index in trajectory to start from 
            }
        },
        ...
    ]
}

Independent ensemble sampling: directly sample independent conformations.

{
    "name": "job_name", 
    "task_mode": "iid",
    "n_replicates": 500, // <int> number of conformation samples
    "cases": [
        {
            "case_id": "7jfl_C",
            "seqres": "SALQDLLRTLKSPSSPQQQQQVLNILKSNPQLMAAFIKQRTAKYVAN",
            // iid sampling does not need conditioning frames
        },
        ...
    ]
}

Conformation interpolation: generate interpolating trajectories between two specified conformations ("conditions") with a specified trajectory length and stride.

{
    "name": "job_name", 
    "task_mode": "interp",
    "n_replicates": 1, // <int> number of replicated trajectories for each case.
    "n_frames": 10, // <int> number of frames in each generated trajectory (including the conditioning frames).
    "stride_in_10ps": 120, // <int> interval between frames in the unit of 10 ps.
    "cases": [
        // Option 1: use a pair of .pdb files as start/end conditions
        {
            "case_id": "7jfl_C",
            "seqres": "SALQDLLRTLKSPSSPQQQQQVLNILKSNPQLMAAFIKQRTAKYVAN",
            "conditions": [
              "/path/to/7jfl_C_start.pdb",  // <str> path to the starting .pdb file
              "/path/to/7jfl_C_end.pdb",    // <str> path to the ending .pdb file
            ]
        },
        // Option 2: using two time frames defined in a .xtc file as start/end conditions
        {
            "case_id": "7lp1_A",
            "seqres": "VTQSFLPPGWEMRIAPNGRPFFIDHNTKTTTWEDPRLKF",
            "conditions": {
                "xtc_fpath": "/path/to/7lp1_A.xtc", // <str> .xtc file contains trajectory information
                "pdb_fpath": "/path/to/7lp1_A.pdb", // <str> corresponding .pdb file contaisn the molecule topology
                "frame_idxs": [1000, 3000] // <int> a pair of time frame indices in trajectory to use as start/end conditions
            }
        },
        ...
    ]
}

Result structure

ConfRover saves generation results for each job under the output <job_name/> directory, with each case saved in a separate subdirectory and replicates are suffixed with _sample<idx>. By default, ConfRover save dense trajectories in .xtc format and sparse sampled trajectories (e.g., 20 frames) in .pdb format for preview. Metadata for each run is saved in .info files.

An example output folder structure:

job_name/
├── case_id_1/
│   ├── case_id_1_sample0.xtc   # xtc trajectory file
│   ├── case_id_1_sample0.pdb   # pdb topology file
│   ├── case_id_1_sample0_preview.pdb   # pdb file contains sampled conformations for preview
│   ├── case_id_1_sample0.info  # json format metadata for the run
│   ├── case_id_1_sample1.xtc
│   ├── case_id_1_sample1.pdb
│   ├── case_id_1_sample1_preview.pdb
│   ├── case_id_1_sample1.info
│   └── ...
├── case_id_2/
│   └── ...
└── ...

Intermediate cache assets

ConfRover leverages state-of-the-art folding models to extract protein-level representations as an input. We cache and reuse the MSA and protein representations for efficient generation. We use the $(pwd)/confrover_cache as the default cache location to save these intermediate assets and model weights. Use the --cache_dir argument to specify a different cache location. See --help for more details.

Limitations

Protein-only, single-chain. Current ConfRover models support only proteins and assume a single polypeptide chain.
Out-of-scope use. ConfRover-v1.0 is trained mainly on the ATLAS dataset with 100 ns trajectories, which may restrict learned dynamics to short-timescale, local motions.
Backbone-focused diffusion. Diffusion operates on backbone SE(3) space, with side chains reconstructed through predicted torsional angles, which may reduce accuracy for large rotamer changes.

License

ConfRover code and model weights are licensed under the Apache-2.0 License.

Questions and Issues

Please feel free to reach out to us or open an issue if you encounter any problems or have any questions.

Contributing to ConfRover

We welcome contributions from the community to further improve ConfRover! Please check Contributing for more details.

Code of Conduct

We are committed to create a safe and inclusive environment for all contributors. Please review our Code of Conduct for more details.

Security

If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify Bytedance Security via our security center or vulnerability reporting email.

Please do not create a public GitHub issue.

Acknowledgements

ConfRover builds on prior open source work with components adapted from ColabFold, OpenFold, Ligo-Biosciences, SE3-Diffusion. We gratefully acknowledge these contributions.

Citing ConfRover

If you find ConfRover useful in your research, please cite the following paper:

@article{confrover2025,
  title={Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression},
  author={Shen, Yuning and Wang, Lihao and Yuan, Huizhuo and Wang, Yan and Yang, Bangji and Gu, Quanquan},
  journal={arXiv preprint arXiv:2505.17478},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
examples		examples
src/confrover		src/confrover
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Protein Conformation and Dynamics through Autoregression

Table of Contents

Introduction

Updates

🔔 Stay Updated

Pretrained models

Quick Start

Installation

Python API

Command line interface

Input manifest format

Result structure

Intermediate cache assets

Limitations

License

Questions and Issues

Contributing to ConfRover

Code of Conduct

Security

Acknowledgements

Citing ConfRover

About

Uh oh!

Releases

Packages

Languages

License

ByteDance-Seed/ConfRover

Folders and files

Latest commit

History

Repository files navigation

Learning Protein Conformation and Dynamics through Autoregression

Table of Contents

Introduction

Updates

🔔 Stay Updated

Pretrained models

Quick Start

Installation

Python API

Command line interface

Input manifest format

Result structure

Intermediate cache assets

Limitations

License

Questions and Issues

Contributing to ConfRover

Code of Conduct

Security

Acknowledgements

Citing ConfRover

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages