multi-allelic loci -> quantative genetics & SNPs
Intron: typically but not all begins with a donor site GT, and ends with acceptor site AG Stop codons all are TGA, TAA, TAG
Sometimes intronless genes may be a retrotransposed pseudogene, which will be reverse-transcribed into chromosome in a random location; another type: ribozymes
pre-mRNA -> remove intron (splicing) -> add 5' CAP and 3' polyA -> mature RNA -> export
5' CAP g to protect mRNA from being destroyed by RNAse;
3' polyA usually 20-40nt; polyA signal: ATTAAA / AATAAA
functional annotation
- gene boundaries are delimited
- exon-intron structure identified
- biological functions of product are stated
ab initio predictions based just on sequence patterns
should also consider:
- patterns of evolutionary conservation btw the target and other related organisms
- genomic repeats (tends to be intergenic regions)
Pseudogenes
- genomic fossils of genes which wre once functional but now are defunct
- tends to have codon stats that resemble those of real genes
- they often contain in-frame stops, or invalid or rare splice sites
common assumptions
- no overlapping genes
- no nested genes
- no partial genes
- no noncannonical signal consensus
- no frameshifts or sequencing errors
- optimal parse only
- constraints on feature lenghs (not necessary for GHMM-based gene finders)
- no split start codons
- no split stop codons (CDS with inframe stops do occur)
- no alternative splicing (?)
- no selenocysteine codons (TGA)
- no ambiguity codes (e.g., R, Y, N)
- one haplotype only (SNP density in human: 1 per 1 kb)
how to work?
- get all signals: return a list of all potential donnors / acceptors (GT/AG), start / stop codons (ATG/TAG TAA TGA)
- construct all fowrard-strand ORFs: all intervals in the sequence beginning with start (ATG or AG) and end with stop (TG TAG TAA TGA)
- find open reading frames
GC bias in CDS: in average %GC in CDS > GC in whole sequence, based on codon bias and WMM (weight matrix) score
Automated Models
- Mathmatical models: Hidden Markov Models
- Machine learning: Bayesian Networks and Decision Trees
First annotation pipelines but also consider other evidence: existing annotations (known genes), EST assembly, RepeatMasker predictions, Synteny info.
Cluster analysis
gene set enrichment analysis (GSEA) algorithms
- connectivity-based (hierarchical clustering)
- controid-based (K-means)
- distribution-based
- density-based
Additional features
- promoters TATA box (TATA->CAP->5’UTR) and CCAAT box
- the branch point (upstream of acceptor site)
- CpG island: (preferential in upstream of human genes, esp. housekeeping genes)
- signal peptide (aa seq near C-terminal of a newly translated protein)
- polyA signals (3’UTR -> AATAAA)
- 5' and 3' UTR
- CAP site
Masking common practice: soft-mask low complexity repeats (can occur in CDS) and hard-mask all others
Alternative splicing
- exon skipping
- cassette exons
- intron retention
- alternative polyadenylation
- alternative promoter recognition
ncRNAs
pred methods as evidence: secondary structure and conservation patterns across related species
Methods to predict secondary structure:
- based on thermodynamics: finding the secondary structure that minimize the Gibbs free energy of the RNA molecule
- through the use of stochastic context-free grammers (scFGs)
Promoters
- largely within 40bp of TSS
- most popular methods: weight matrix (WMM)
- typical: McPromoter systems
- UPR (upstream promoter region) -> TATA -> Spacer -> Inr (initiator) -> DPE (downstream promoter elements)
- not all core promoters have TATA/Inr
- another: CpG island strongly assoicated with promoters
differential expression
different activators signaling pathways by stimulus
a given gene may have multiple enhancers, each active at a different time or in a different cell type
each enhancer is associated with one gene