This work can be used to model ensembles of pMHC-I molecules with peptides of lengths 8-13 amino acids.
Note: The following instructions are for generating structural ensembles given inferences from the ML component of P-flex
- Navigate to the pmhc-flex folder after dowloading from gitlab.
- From within pmhc-flex folder, run install script
chmod +x ./scripts/install.sh
./scripts/install.sh
The install script will clone mhc-pep-threader, download clustalo and set up appropriate python paths. Please double check to make sure the paths are setup correctly. 3. Download and install pyrosetta through the wheel file from here https://www.pyrosetta.org/downloads. 4. If the installation proceeds without issues, then all the necessary libaries should be installed to run your first simulation. 5. Download the database folder from zenodo and place it under pmhc-flex folder.
If you want to run benchmarks, please download the md_analyses folder from zenodo and place it inside pmhc-flex dir. In the example folder, we have the following dirs:
unk.csv
- file containing peptide and MHC sequences for which we want to generate ensembles
test.npy
- file containing inferences for pMHC sequences in unk.csv
test.txt
- file containing PDBIDs to run benchmarks on
unk.csv
- file containing peptide and MHC sequences for which we want to generate ensembles
test.npy
- file containing inferences for pMHC sequences in unk.csv
test.txt
- file containing PDBIDs to run benchmarks on
The folders inside outputs were runs we performed for example cases:
- short_unk_cst - generate ensembles for short length peptides in short/unk.csv where the examples are treated as an unknown pMHC pair. Constraint-based templates are used as starting models.
- short_unk_seq - generate ensembles for short length peptides in short/unk.csv where the examples are treated as an unknown pMHC pair. Sequence-based templates are used as starting models.
- short_benchmark_cst - generate ensembles for short length peptides in short/test.txt where the examples are treated as an unknown pMHC pair. Constraint-based templates are used as starting models.
- short_benchmark_seq - generate ensembles for short length peptides in short/test.txt where the examples are treated as an unknown pMHC pair. Sequence-based templates are used as starting models.
- long_unk - generate ensembles for short length peptides in long/unk.csv where the examples are treated as an unknown pMHC pair. Multiple templates are used as starting models.
Explanation of output files:
- Inside LLFGYPVYV_A02:01* folder, you will find subfolder (6PTE_reordered) highlighting the template used. For instance, for this peptide, 6PTE was used as a starting model.
- Within 6PTE_reordered,
a. Inputs to the mhc-pep-threader, mhc_list and pep_list.
b. run command to run mhc-pep-threader.
c. Outputs produced by mhc-pep-threader,
6PTE_reordered*.pdb
are cleaned and trimmed templates for homology modeling, clustal input and output files, Rosetta alignment (grishin) and restricted relax (movemap) files, homology model, relaxed homology model and binding energy of the relaxed model. d. FPD models with substring "dock_refined" in the pdb file name. - Inside benchmark runs, we have additional analyses files: a. _scores.csv: a csv file containing rosetta scores and comparison with native scores b. md_model_comparison.tsv: a tsv file containing n (FPD models) x 300 (md samples) comparison scores.
This codebase uses hydra. So the parameters are pretty easy to update. All the config files are in pmhc_flex/src/gen_ensemble/hydra_config
. You can update the parameters according to your desired values based on the use case. Below are the explanations of the parameters in each of the hydra yaml file:
- orchestrator/default.yaml
# initialize parameters
initialize:
distance_ctype - # function to adjust the energy landscape based on distance constraints; default is FLAT_HARMONIC
dihedral_ctype - # function to adjust the energy landscape based on dihedral constraints; default is CIRCULAR_HARMONIC
cst_based_template - Selected templates based on predicted constraints
benchmarks:
run_benchmark - set this flag to true if you want to run benchmarks
run:
unk - set this flag to true if the pMHC you want to model is not a benchmark (or an unknown sequence for which you want to generate models)
pred_cst - set this flag to true if you want to generate ensembles using constraints (in this version of the code, we do not support running or generating ensembles without using any constraints)
unk_sequences - path to sequence list. See examples to learn the formats
testset - path to PDBID list for benchmarks. See examples to learn the formats
inferred_csts - path to inferred csts from ML component. See examples to learn the formats (the npy files are output by the ML component, so we do not need to edit this file)
output:
outdir - path to output dir, will be automatically updated inside the code.
fasta - path to PDBS fasta file containing peptide and MHC sequences in the database/lib
dir- path to PDB files in database/lib/TRIMMED_TEMPLATES
multi_template_dir - path multiple templates in database/lib/multi_templates for longer length peptides
md:
compare_with_native - compate MD frames with native structures; default to false since these are precomputed and make available in the md_analyses folder
dir - path MD frames
rms_to_native - path to file containing comparison metrics between MD frames and natives
- homology_model/default.yaml
path_to_bin - path to mhc-pep-threader
nstruct - number of structures to relax; default to 1
mhcs - list of MHC alleles to model for each peptide; see examples for formats
peptides - list of peptides to model; see examples for formats
pep_start_index - start index of the peptide; default to 181
interface_cutpoint - end index of the MHC; default to 180
relax_after_threading - refine models after threading; default to true
template_file - template file used for homology modeling; will be automatically updated by the orchestrator
template_pdb - template pdbid used for homology modeling; will be automatically updated by the orchestrator
mhc_chain - mhc chain ID of the template pdb; will be automatically updated by the orchestrator
peptide_chain - peptide chain ID of the template pdb; will be automatically updated by the orchestrator
outdir - path to the output directory where homology models are placed; will be automatically updated by the orchestrator
allele - input allele; will be automatically updated by the orchestrator
peptide - input peptide; will be automatically updated by the orchestrator
clustal_path - path to the clustal omega binary
- flexpepdock/default.yaml
receptor_chain - receptor (HLA) chain; default to chain A
peptide_chain - peptide chain; default to chain B
pep_refine - run fpd refine; default to true
nstruct - number of models to output; default to 3 for the purpose of testing, please increase this number when sampling more conformers
suffix - suffix for models; default to _dock_refined
cst_fa_weight - full-atom constraint weights, default to 0.1
cst_weight - constraint weights, default to 0.1
input_model - input model; will be automatically updated by the orchestrator
From within the examples folder, run the following command:
python ../src/gen_ensemble/orchestrator/orchestrate.py
- In the paper, the benchmark was done by directly tuning FPD temperature in the Rosetta source code due to no availability of parameters interfacing the user to tune the temperature. If there is sufficient interest in this, then please contact us and we can assist in updating the source to reflect the temperature change from 0.2 to 0.8.
- If you want to cluster fdp and md ensembles, you can invoke cluster_pipeline by placing the FDP and MD models together in a dir and invoking the method with that dir as input.
You can also update the database by running setup_db.sh
script from within pmhc_flex/scripts