Featured image of post Genome annotation pipeline and tools

Genome annotation pipeline and tools

Notes on eukaryotic genome annotation.

(Title image from yourgenome.org)

Pipeline to assemble and annotate an eukaryotic genome.

Based on A beginner’s guide to eukaryotic genome annotation, which is a very good resource.

1. Genome Assembly

Statistics to judge: N50, average gap size of a scaffold, and average number of gaps per scaffold

Tools: CEGMA..

2. Genome Annotation

2.1 Identify repeats and mask the genome

Repeats include ‘low-complexity’ sequences and transposable (mobile) elements.

Tools: RepeatMasker

2.2 Align evidence to the genome

Including ESTs, RNAseq data etc.

Aligners: STAR, hisat2, topHat, BWA, Cufflinks, etc.

2.3 Annotate protein-coding genes

2.3.1 Structural annotation

Can use supervised or unsupervised training, by Ab initio gene prediction or evidence-driven gene prediction.

Tools: Augustus, BRAKER1,

Assessing annotation quality:

  • percentage of annotations that encode proteins with known domains
  • Annotation edit distance (AED) matrix (from MAKER2)
  • in-frame stop codons

2.3.2 Functional annotation

  • BLAST protein sequences: NCBI, UniProt
  • Protein domains: CDD, InterProScan,
  • GO terms and Pathway mapping

2.4 Annotate non-coding RNAs

  • tRNA: tRNAscan-SE
  • small non-coding RNAs: Snoscan
  • MicroRNA: miRDB, miRDeep2
  • Pipelines: Ensembl

2.5 Annotation visualisation

  • Genome browser: Artemis, Apollo, JBrowse, GBrowse

3. Curation

comments powered by Disqus
CC-BY-NC 4.0
Built with Hugo Theme Stack