This repository packages a production-oriented RNA-seq differential expression workflow that is tuned for Daylily Informatics' AWS ParallelCluster environments. It layers AWS-aware execution, caching, and reporting conventions on top of the familiar STAR ➜ feature counting ➜ DESeq2 analysis stack so teams can iterate on transcriptomics projects without reinventing infrastructure.
- Turnkey Snakemake workflow for paired-end RNA-seq, orchestrated by the
pcluster-slurmSnakemake executor plugin to seamlessly submit jobs to AWS ParallelCluster Slurm schedulers. - Reference and resource management helpers that cache STAR indices, Singularity images, and Conda environments on shared FSx/NFS mounts for rapid re-use across projects.
- Quality control dashboards generated with FastQC, MultiQC, and RSeQC, with example outputs captured under
docs/results_tree.logfor quick browsing of the report structure. - Differential expression reporting via DESeq2, producing count matrices and annotated contrasts suitable for downstream visualization or knowledge bases.
- Reproducible environment bootstrapping with helper scripts (for example,
bin/install_miniconda) and configuration templates that reduce the friction of onboarding new analysts to Daylily's RNA-seq stack.
The docs/ directory contains reference directory trees from a representative run:
docs/resources_tree.logshows the curated genome resources bundle (FASTA, GTF, and STAR genome directory) expected by the workflow.docs/results_tree.logenumerates the outputs produced during a small treated-vs-untreated comparison, including STAR alignment BAMs, read count tables, DESeq2 normalized counts, MA plots, and QC summaries.
These examples illustrate the layout teams can rely on when integrating results with downstream analytics or long-term storage policies.
- Rapid validation of new reference builds by rebuilding STAR indices on the cluster and verifying the generated QC reports against expected metrics.
- Budget-aware large cohort processing where the workflow's support for Slurm comments (e.g.,
SMK_SLURM_COMMENT) and job partition targeting simplifies cost tracking across Daylily's organizational projects. - Iterative methods development for Daylily scientists who need to test alternative quantification or normalization strategies while preserving a stable baseline pipeline for comparison.
- This has been developed to run on an AWS Parallel Cluster slurm headnode, specifically one created using https://github.com/Daylily-Informatics/daylily-ephemeral-cluster.
- An active AWS ParallelCluster deployment with Slurm (either a self-managed cluster or the daylily-ephemeral-cluster).
- Conda or Miniconda available on the head node. The provided
bin/install_minicondascript can be used if Conda is not already present. - User access to shared FSx (or comparable) storage for caching environments, containers, and references.
Clone the repository (it includes small example data sets):
git clone [email protected]:Daylily-Informatics/rna-seq-star-deseq2.git
cd rna-seq-star-deseq2Install the Daylily Informatics fork of Snakemake that bundles AWS ParallelCluster integration alongside the executor plugin dependencies.
conda create -n snakemake -c conda-forge tabulate yaml
conda activate snakemake
pip install git+https://github.com/Daylily-Informatics/[email protected]
pip install snakemake-executor-plugin-pcluster-slurm==0.0.31
pip install snakedeploy
conda activate srrda
snakemake --version # 9.11.4.1conda activate srrda
mkdir -p /fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export TMPDIR=/fsx/scratch/Prepare run descriptors:
cp config/units.tsv.template config/units.tsv
[[ "$(uname)" == "Darwin" ]] && sed -i "" "s|REGSUB_PWD|$PWD|g" config/units.tsv || sed -i "s|REGSUB_PWD|$PWD|g" config/units.tsvBuild Conda environments ahead of time to minimize surprises during production submissions (can take ~1 hour):
snakemake --use-conda --use-singularity \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i192,i128 runtime=86400 mem_mb=36900 tmpdir=/fsx/scratch \
--cache -p \
-k \
--max-threads 20000 \
--restart-times 2 \
--cores 20000 -j 14 -n \
--conda-create-envs-onlyNote: Run once with
--conda-create-envs-onlyto populate environments, then rerun without-nto execute the workflow. Setting--max-threadsand--coresabove your head-node CPU count works around Slurm thread detection quirks on AWS ParallelCluster.
Update config/units.tsv, config/samples.tsv, and config/config.yaml with your project-specific metadata. Then launch the workflow:
snakemake --use-conda --use-singularity \
--singularity-prefix /fsx/resources/environments/containers/ubuntu/ \
--singularity-args "-B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" \
--conda-prefix /fsx/resources/environments/containers/ubuntu/ \
--executor pcluster-slurm \
--default-resources slurm_partition=i192,i128 runtime=86400 mem_mb=36900 tmpdir=/fsx/scratch \
--cache -p \
-k \
--restart-times 2 \
--max-threads 20000 \
--cores 20000 -j 14 \
--include-aws-benchmark-metricsMonitor job states with watch squeue and adjust partitions (slurm_partition=...) to match the compute queues defined for your cluster.
- Place sample FASTQ paths and associated metadata in
config/units.tsvandconfig/samples.tsv. - Review
config/config.yamlfor alignment, quantification, and contrast options. - Perform a dry run with
-nto validate DAG construction and resource requests before launching full scale analyses.
This repository has evolved beyond the original public RNA-seq workflow. Previous references to that project have been removed to reduce confusion; the documentation above reflects the Daylily-specific tooling now maintained here.