Notes on Methods for Computational Gene Prediction

multi-allelic loci -> quantative genetics & SNPs

Intron: typically but not all begins with a donor site GT, and ends with acceptor site AG Stop codons all are TGA, TAA, TAG

Sometimes intronless genes may be a retrotransposed pseudogene, which will be reverse-transcribed into chromosome in a random location; another type: ribozymes

pre-mRNA -> remove intron (splicing) -> add 5' CAP and 3' polyA -> mature RNA -> export

5' CAP g to protect mRNA from being destroyed by RNAse;

3' polyA usually 20-40nt; polyA signal: ATTAAA / AATAAA

functional annotation

gene boundaries are delimited
exon-intron structure identified
biological functions of product are stated

ab initio predictions based just on sequence patterns

should also consider:

patterns of evolutionary conservation btw the target and other related organisms
genomic repeats (tends to be intergenic regions)

Pseudogenes

genomic fossils of genes which wre once functional but now are defunct
tends to have codon stats that resemble those of real genes
they often contain in-frame stops, or invalid or rare splice sites

common assumptions

no overlapping genes
no nested genes
no partial genes
no noncannonical signal consensus
no frameshifts or sequencing errors
optimal parse only
constraints on feature lenghs (not necessary for GHMM-based gene finders)
no split start codons
no split stop codons (CDS with inframe stops do occur)
no alternative splicing (?)
no selenocysteine codons (TGA)
no ambiguity codes (e.g., R, Y, N)
one haplotype only (SNP density in human: 1 per 1 kb)

how to work?

get all signals: return a list of all potential donnors / acceptors (GT/AG), start / stop codons (ATG/TAG TAA TGA)
construct all fowrard-strand ORFs: all intervals in the sequence beginning with start (ATG or AG) and end with stop (TG TAG TAA TGA)
find open reading frames

GC bias in CDS: in average %GC in CDS > GC in whole sequence, based on codon bias and WMM (weight matrix) score

Automated Models

Mathmatical models: Hidden Markov Models
Machine learning: Bayesian Networks and Decision Trees

First annotation pipelines but also consider other evidence: existing annotations (known genes), EST assembly, RepeatMasker predictions, Synteny info.

Cluster analysis

gene set enrichment analysis (GSEA) algorithms

connectivity-based (hierarchical clustering)
controid-based (K-means)
distribution-based
density-based

Additional features

promoters TATA box (TATA->CAP->5’UTR) and CCAAT box
the branch point (upstream of acceptor site)
CpG island: (preferential in upstream of human genes, esp. housekeeping genes)
signal peptide (aa seq near C-terminal of a newly translated protein)
polyA signals (3’UTR -> AATAAA)
5' and 3' UTR
CAP site

Masking common practice: soft-mask low complexity repeats (can occur in CDS) and hard-mask all others

Alternative splicing

exon skipping
cassette exons
intron retention
alternative polyadenylation
alternative promoter recognition

ncRNAs

pred methods as evidence: secondary structure and conservation patterns across related species

Methods to predict secondary structure:

based on thermodynamics: finding the secondary structure that minimize the Gibbs free energy of the RNA molecule
through the use of stochastic context-free grammers (scFGs)

Promoters

largely within 40bp of TSS
most popular methods: weight matrix (WMM)
typical: McPromoter systems
UPR (upstream promoter region) -> TATA -> Spacer -> Inr (initiator) -> DPE (downstream promoter elements)
not all core promoters have TATA/Inr
another: CpG island strongly assoicated with promoters

differential expression

different activators signaling pathways by stimulus

a given gene may have multiple enhancers, each active at a different time or in a different cell type

each enhancer is associated with one gene