学习的第一个 GATK 找变异流程,人的种系变异的短序列变异,包含 SNP 和 INDEL。写了一个 SnakeMake 剖析流程,从 fastq 文件到最初的 vep 正文后的 VCF 文件,对于 VCF 的介绍能够参考上一篇推文基因序列变异信息 VCF (Variant Call Format)
流程代码在 https://jihulab.com/BioQuest/smkhgs 或 https://github.com/BioQuestX/smkhgs
README
GATK best practices workflow Pipeline summary
SnakeMake workflow for Human Germline short variants (SNP+INDEL)
Reference
- Reference genome related files and GTAK budnle files (GATK)
- VEP Variarition annotation files (VEP)
Prepare
- Adapter trimming (Fastp)
- Aligner (BWA mem2)
- Mark duplicates (samblaster)
- Generates recalibration table for Base Quality Score Recalibration (BaseRecalibrator)
- Apply base quality score recalibration (ApplyBQSR)
Quality control report
- Fastp report (MultiQC)
- Alignment report (MultiQC)
Call
- Call germline SNPs and indels via local re-assembly of haplotypes (HaplotypeCaller)
- Import VCFs to GenomicsDB (GenomicsDBImport)
- Perform joint genotyping on one or more samples pre-called with HaplotypeCaller (GenotypeGVCFs)
Filter
- Select a SNP or INDEL of variants from a VCF file (SelectVariants)
- Build a recalibration model to score variant quality for filtering purposes (VariantRecalibrator)
- Apply a score cutoff to filter variants based on a recalibration table (ApplyVQSR)
- Merge all the VCF files (Picard)
Annotation
Annotate variant calls with VEP (VEP)
SnakeMake Report
Outputs
.
├── config
│ ├── captured_regions.bed
│ ├── config.yaml
│ └── samples.tsv
├── dag.svg
├── logs
│ ├── annotate
│ ├── call
│ ├── filter
│ ├── prepare
│ ├── qc
│ ├── ref
│ └── trim
├── raw
│ ├── SRR24443168.fastq.gz
│ └── SRR24443169.fastq.gz
├── README.md
├── report
│ ├── fastp_multiqc_data
│ ├── fastp_multiqc.html
│ ├── prepare_multiqc_data
│ ├── prepare_multiqc.html
│ └── vep_report.html
├── results
│ ├── called
│ ├── filtered
│ ├── prepared
│ ├── trimmed
│ └── vep_annotated.vcf.gz
├── workflow
│ ├── envs
│ ├── report
│ ├── rules
│ ├── schemas
│ ├── scripts
│ └── Snakefile
Directed Acyclic Graph
Reference
GATK best practices workflow: https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows
GATK: https://software.broadinstitute.org/gatk/
VEP: https://www.ensembl.org/info/docs/tools/vep/index.html
fastp: https://github.com/OpenGene/fastp
BWA mem2: http://bio-bwa.sourceforge.net/
samblaster: https://github.com/GregoryFaust/samblaster
BaseRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/13832708374939-BaseRecalibrator
ApplyBQSR: https://github.com/GregoryFaust/samblaster
HaplotypeCaller: https://gatk.broadinstitute.org/hc/en-us/articles/13832687299739-HaplotypeCaller
GenomicsDBImport: https://gatk.broadinstitute.org/hc/en-us/articles/13832686645787-GenomicsDBImport
GenotypeGVCFs: https://gatk.broadinstitute.org/hc/en-us/articles/13832766863259-GenotypeGVCFs
SelectVariants: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-SelectVariants
VariantRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-VariantRecalibrator
ApplyVQSR: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-ApplyVQSR
Picard: https://broadinstitute.github.io/picard
MultiQC: https://multiqc.info