学习的第一个GATK找变异流程,人的种系变异的短序列变异,包含SNP和INDEL。写了一个SnakeMake剖析流程,从fastq文件到最初的vep正文后的VCF文件,对于VCF的介绍能够参考上一篇推文基因序列变异信息VCF (Variant Call Format)
流程代码在https://jihulab.com/BioQuest/smkhgs或https://github.com/BioQuestX/smkhgs
README
GATK best practices workflow Pipeline summary
SnakeMake workflow for Human Germline short variants (SNP+INDEL)
Reference
- Reference genome related files and GTAK budnle files (GATK)
- VEP Variarition annotation files (VEP)
Prepare
- Adapter trimming (Fastp)
- Aligner (BWA mem2)
- Mark duplicates (samblaster)
- Generates recalibration table for Base Quality Score Recalibration (BaseRecalibrator)
- Apply base quality score recalibration (ApplyBQSR)
Quality control report
- Fastp report (MultiQC)
- Alignment report (MultiQC)
Call
- Call germline SNPs and indels via local re-assembly of haplotypes (HaplotypeCaller)
- Import VCFs to GenomicsDB (GenomicsDBImport)
- Perform joint genotyping on one or more samples pre-called with HaplotypeCaller (GenotypeGVCFs)
Filter
- Select a SNP or INDEL of variants from a VCF file (SelectVariants)
- Build a recalibration model to score variant quality for filtering purposes (VariantRecalibrator)
- Apply a score cutoff to filter variants based on a recalibration table (ApplyVQSR)
- Merge all the VCF files (Picard)
Annotation
Annotate variant calls with VEP (VEP)
SnakeMake Report
Outputs
.├── config│ ├── captured_regions.bed│ ├── config.yaml│ └── samples.tsv├── dag.svg├── logs│ ├── annotate│ ├── call│ ├── filter│ ├── prepare│ ├── qc│ ├── ref│ └── trim├── raw│ ├── SRR24443168.fastq.gz│ └── SRR24443169.fastq.gz├── README.md├── report│ ├── fastp_multiqc_data│ ├── fastp_multiqc.html│ ├── prepare_multiqc_data│ ├── prepare_multiqc.html│ └── vep_report.html├── results│ ├── called│ ├── filtered│ ├── prepared│ ├── trimmed│ └── vep_annotated.vcf.gz├── workflow│ ├── envs│ ├── report│ ├── rules│ ├── schemas│ ├── scripts│ └── Snakefile
Directed Acyclic Graph
Reference
GATK best practices workflow: https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-WorkflowsGATK: https://software.broadinstitute.org/gatk/VEP: https://www.ensembl.org/info/docs/tools/vep/index.htmlfastp: https://github.com/OpenGene/fastpBWA mem2: http://bio-bwa.sourceforge.net/samblaster: https://github.com/GregoryFaust/samblasterBaseRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/13832708374939-BaseRecalibratorApplyBQSR: https://github.com/GregoryFaust/samblasterHaplotypeCaller: https://gatk.broadinstitute.org/hc/en-us/articles/13832687299739-HaplotypeCallerGenomicsDBImport: https://gatk.broadinstitute.org/hc/en-us/articles/13832686645787-GenomicsDBImportGenotypeGVCFs: https://gatk.broadinstitute.org/hc/en-us/articles/13832766863259-GenotypeGVCFsSelectVariants: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-SelectVariantsVariantRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-VariantRecalibratorApplyVQSR: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-ApplyVQSRPicard: https://broadinstitute.github.io/picardMultiQC: https://multiqc.info