🍥

Zhigang Lu

Keep it simple.

Genome annotation pipeline and tools

Notes on eukaryotic genome annotation.

(Title image from yourgenome.org)

Pipeline to assemble and annotate an eukaryotic genome.

Based on A beginner’s guide to eukaryotic genome annotation, which is a very good resource.

1. Genome Assembly

Statistics to judge: N50, average gap size of a scaffold, and average number of gaps per scaffold

Tools: CEGMA..

2. Genome Annotation

2.1 Identify repeats and mask the genome

Repeats include ‘low-complexity’ sequences and transposable (mobile) elements.

Tools: RepeatMasker

2.2 Align evidence to the genome

Including ESTs, RNAseq data etc.

Aligners: STAR, hisat2, topHat, BWA, Cufflinks, etc.

2.3 Annotate protein-coding genes

2.3.1 Structural annotation

Can use supervised or unsupervised training, by Ab initio gene prediction or evidence-driven gene prediction.

Tools: Augustus, BRAKER1,

Assessing annotation quality:

percentage of annotations that encode proteins with known domains
Annotation edit distance (AED) matrix (from MAKER2)
in-frame stop codons

2.3.2 Functional annotation

BLAST protein sequences: NCBI, UniProt
Protein domains: CDD, InterProScan,
GO terms and Pathway mapping

2.4 Annotate non-coding RNAs

tRNA: tRNAscan-SE
small non-coding RNAs: Snoscan
MicroRNA: miRDB, miRDeep2
Pipelines: Ensembl

2.5 Annotation visualisation

Genome browser: Artemis, Apollo, JBrowse, GBrowse

3. Curation

Curator comments guidelines (FlyBase)

comments powered by Disqus