关于python:01GATK人种系变异最佳实践SnakeMake流程WorkFlow简介

学习的第一个GATK找变异流程,人的种系变异的短序列变异,包含SNP和INDEL。写了一个SnakeMake剖析流程,从fastq文件到最初的vep正文后的VCF文件,对于VCF的介绍能够参考上一篇推文基因序列变异信息VCF (Variant Call Format)

流程代码在https://jihulab.com/BioQuest/smkhgs或https://github.com/BioQuestX/smkhgs

README

GATK best practices workflow Pipeline summary

SnakeMake workflow for Human Germline short variants (SNP+INDEL)

Reference

  1. Reference genome related files and GTAK budnle files (GATK)
  2. VEP Variarition annotation files (VEP)

Prepare

  1. Adapter trimming (Fastp)
  2. Aligner (BWA mem2)
  3. Mark duplicates (samblaster)
  4. Generates recalibration table for Base Quality Score Recalibration (BaseRecalibrator)
  5. Apply base quality score recalibration (ApplyBQSR)

Quality control report

  1. Fastp report (MultiQC)
  2. Alignment report (MultiQC)

Call

  1. Call germline SNPs and indels via local re-assembly of haplotypes (HaplotypeCaller)
  2. Import VCFs to GenomicsDB (GenomicsDBImport)
  3. Perform joint genotyping on one or more samples pre-called with HaplotypeCaller (GenotypeGVCFs)

Filter

  1. Select a SNP or INDEL of variants from a VCF file (SelectVariants)
  2. Build a recalibration model to score variant quality for filtering purposes (VariantRecalibrator)
  3. Apply a score cutoff to filter variants based on a recalibration table (ApplyVQSR)
  4. Merge all the VCF files (Picard)

Annotation

Annotate variant calls with VEP (VEP)

SnakeMake Report

Outputs

.
├── config
│   ├── captured_regions.bed
│   ├── config.yaml
│   └── samples.tsv
├── dag.svg
├── logs
│   ├── annotate
│   ├── call
│   ├── filter
│   ├── prepare
│   ├── qc
│   ├── ref
│   └── trim
├── raw
│   ├── SRR24443168.fastq.gz
│   └── SRR24443169.fastq.gz
├── README.md
├── report
│   ├── fastp_multiqc_data
│   ├── fastp_multiqc.html
│   ├── prepare_multiqc_data
│   ├── prepare_multiqc.html
│   └── vep_report.html
├── results
│   ├── called
│   ├── filtered
│   ├── prepared
│   ├── trimmed
│   └── vep_annotated.vcf.gz
├── workflow
│   ├── envs
│   ├── report
│   ├── rules
│   ├── schemas
│   ├── scripts
│   └── Snakefile

Directed Acyclic Graph

Reference

GATK best practices workflow: https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows
GATK: https://software.broadinstitute.org/gatk/
VEP: https://www.ensembl.org/info/docs/tools/vep/index.html
fastp: https://github.com/OpenGene/fastp
BWA mem2: http://bio-bwa.sourceforge.net/
samblaster: https://github.com/GregoryFaust/samblaster
BaseRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/13832708374939-BaseRecalibrator
ApplyBQSR: https://github.com/GregoryFaust/samblaster
HaplotypeCaller: https://gatk.broadinstitute.org/hc/en-us/articles/13832687299739-HaplotypeCaller
GenomicsDBImport: https://gatk.broadinstitute.org/hc/en-us/articles/13832686645787-GenomicsDBImport
GenotypeGVCFs: https://gatk.broadinstitute.org/hc/en-us/articles/13832766863259-GenotypeGVCFs
SelectVariants: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-SelectVariants
VariantRecalibrator: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-VariantRecalibrator
ApplyVQSR: https://gatk.broadinstitute.org/hc/en-us/articles/13832694334235-ApplyVQSR
Picard: https://broadinstitute.github.io/picard
MultiQC: https://multiqc.info

【腾讯云】轻量 2核2G4M,首年65元

阿里云限时活动-云数据库 RDS MySQL  1核2G配置 1.88/月 速抢

本文由乐趣区整理发布,转载请注明出处,谢谢。

您可能还喜欢...

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据