A Snakemake pipeline to prepare, impute and process binary PLINK files on the Michigan/TOPMed Imputation Servers.
- Anaconda or Miniconda python
- Singularity
- Python3 (The future is now!)
- Snakemake
All other reqirements will be installed locally by the pipeline at runtime.
config/config.yaml contains settings for QC and preparation, submission and download, and post-imputation processing.
Please review and edit the configuration file before starting.
By default, the pipeline looks for PLINK filesets in the base directory of your installation. You can choose a different destination by editing directory. Cohort names are infered from the stem of the plink filesets.
If you do not want all filesets in the directory to be processed and imputed, then you can include COHORT: or SAMPLES: and a list of cohorts in the config file like this:
COHORT:
  - group1
  - study2or
COHORT: [group1, study2]The output files for the pipeline will be stored in the directory specified by out_dir. This directory must exist.
Output file choices can be specified under outputs, and you can delete or comment out any list item you do not want. The options follow:
- stat_reportA report of imputation quality
- vcf_bycohortBgzipped VCF files for each cohort
- vcf_mergedA bgzipped VCF with all cohorts
- bgen_bycohortBinary oxford filesets for each cohort
- bgen_mergedA binary oxford fileset with all cohorts
- plink_bycohortBinary plink filesets for each cohort
- plink_mergedA binary plink fileset with all cohorts
This pipeline can impute using the NIH imputatation server and the Michigan Imputation Server. You must have an account and API key with any server you plan to use.
Imputation settings are stored under imputation: as a hash/dictionary of cohort names, with a hash of settings underneath. You can provide default settings with the hash key default, and/or override those settings with your individual cohort names.
The hash of settings follows the API specifications here and here, in addition to your API key, stored under token. files and password are automatically set. Additionally, if you are planning to use HRC or TOPMed, and you just provide the refpanel or server along with your API key, best-practice defaults will be used.
Options for server are NIH or Michigan. Case does not matter.
For each cohort specified, the pipeline will override the defaults, with the specified settings. If those are unchanged from the default, they do not need to be edited.
Here is an example:
imputation:
  default:
    token: token_here
    refpanel: topmed-r3
  study2:
    token: other_token_here
    server: Michigan
    refpanel: gasp-v2
    population: ASNThis will run study2 on the Michigan server using the GAsP panel with the ASN population, and all other cohorts with TOPMed using all populations.
Select the chromosomes you want to upload by editing chroms. Separate ranges with ":" and listed chromosomes with ",". You can put both in the same string. Use M for mitochondra.
Options are 1:22,X,Y,M
The pipeline will reference-align the input files before imputation using the fasta file specified under ref:. This fasta MUST be in the same genome build as the input. Input genome builds must match the builds supported by the chosen server, and you must specify the build under imputation if it does not match the default for the server.
This pipeline supports all builds for the downloaded imputed files. Be aware that TOPMed imputation returns GRCh38 genotypes.
The pipeline can both do QC in preparation for upload and filter the files provided by the servers on a per-cohort basis. Pre-imputation QC is under preqc: and post-imputation QC is under postqc:. The options are documented in the configuration file.
The imputation server imputes in 10kBase chunks, removing chunks that have below a 50% callrate in any subject. To avoid this, we can filter such that no chromosome has above 20% missingness (chr_callrate: True), or that no imputation chunk with at least 50 variants has a missingness greater than 50% by removing subjects who violate the criterion. You can skip both checks by setting both options to false, but you can only perform a maximum of one of these checks. The chunk check is recomended.
You can also include OR exclude samples by a named list file using one of the following options:
- include_samp:filename of samples to include (plink --keep)
- exclude_samp:filename of samples to exclude (plink --remove)
include_samp accepts a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column, and removes all unlisted samples from the current analysis. exclude_samp does the same for all listed samples.
You must run the pipeline with Anaconda environments and Singularity enabled. You can either use a Snakemake profile to do so or run the pipeline with the following command:
snakemake --use-conda --use-singularityMake sure your cluster supports the amount of memory required for report generation (528 GiB) and that the memory is being properly requested. If that is not the case, you can edit the resource requirements on the rule stats in the postImpute pipeline and modify the modularization lines in workflow/Snakefile in this pipeline. We have included a lsf.yaml file that ensures those resources are available within this pipeline.