(Title image from yourgenome.org)
Pipeline to assemble and annotate an eukaryotic genome.
Based on A beginner’s guide to eukaryotic genome annotation, which is a very good resource.
1. Genome Assembly
Statistics to judge: N50, average gap size of a scaffold, and average number of gaps per scaffold
2. Genome Annotation
2.1 Identify repeats and mask the genome
Repeats include ‘low-complexity’ sequences and transposable (mobile) elements.
2.2 Align evidence to the genome
Including ESTs, RNAseq data etc.
Aligners: STAR, hisat2, topHat, BWA, Cufflinks, etc.
2.3 Annotate protein-coding genes
2.3.1 Structural annotation
Can use supervised or unsupervised training, by Ab initio gene prediction or evidence-driven gene prediction.
Tools: Augustus, BRAKER1,
Assessing annotation quality:
- percentage of annotations that encode proteins with known domains
- Annotation edit distance (AED) matrix (from MAKER2)
- in-frame stop codons
2.3.2 Functional annotation
- BLAST protein sequences: NCBI, UniProt
- Protein domains: CDD, InterProScan,
- GO terms and Pathway mapping
2.4 Annotate non-coding RNAs
- tRNA: tRNAscan-SE
- small non-coding RNAs: Snoscan
- MicroRNA: miRDB, miRDeep2
- Pipelines: Ensembl
2.5 Annotation visualisation
- Genome browser: Artemis, Apollo, JBrowse, GBrowse