Genome annotation pipeline and tools

Notes on eukaryotic genome annotation.

(Title image from yourgenome.org)

Pipeline to assemble and annotate an eukaryotic genome.

Based on A beginner’s guide to eukaryotic genome annotation, which is a very good resource.

1. Genome Assembly

Statistics to judge: N50, average gap size of a scaffold, and average number of gaps per scaffold

Tools: CEGMA..

2. Genome Annotation

2.1 Identify repeats and mask the genome

Repeats include ‘low-complexity’ sequences and transposable (mobile) elements.

Tools: RepeatMasker

2.2 Align evidence to the genome

Including ESTs, RNAseq data etc.

Aligners: STAR, hisat2, topHat, BWA, Cufflinks, etc.

2.3 Annotate protein-coding genes

2.3.1 Structural annotation

Can use supervised or unsupervised training, by Ab initio gene prediction or evidence-driven gene prediction.

Tools: Augustus, BRAKER1,

Assessing annotation quality:

  • percentage of annotations that encode proteins with known domains
  • Annotation edit distance (AED) matrix (from MAKER2)
  • in-frame stop codons

2.3.2 Functional annotation

  • BLAST protein sequences: NCBI, UniProt
  • Protein domains: CDD, InterProScan,
  • GO terms and Pathway mapping

2.4 Annotate non-coding RNAs

  • tRNA: tRNAscan-SE
  • small non-coding RNAs: Snoscan
  • MicroRNA: miRDB, miRDeep2
  • Pipelines: Ensembl

2.5 Annotation visualisation

  • Genome browser: Artemis, Apollo, JBrowse, GBrowse

3. Curation

