Notes on Methods for Computational Gene Prediction

Notes on Methods for Computational Gene Prediction

Some self-notes from the book Methods for Computational Gene Prediction by William H. Majoros.

multi-allelic loci -> quantative genetics & SNPs

Intron: typically but not all begins with a donor site GT, and ends with acceptor site AG Stop codons all are TGA, TAA, TAG

Sometimes intronless genes may be a retrotransposed pseudogene, which will be reverse-transcribed into chromosome in a random location; another type: ribozymes

pre-mRNA -> remove intron (splicing) -> add 5’ CAP and 3’ polyA -> mature RNA -> export

5’ CAP g to protect mRNA from being destroyed by RNAse;

3’ polyA usually 20-40nt; polyA signal: ATTAAA / AATAAA

functional annotation

  • gene boundaries are delimited
  • exon-intron structure identified
  • biological functions of product are stated

ab initio predictions based just on sequence patterns

should also consider:

  • patterns of evolutionary conservation btw the target and other related organisms
  • genomic repeats (tends to be intergenic regions)

Pseudogenes

  • genomic fossils of genes which wre once functional but now are defunct
  • tends to have codon stats that resemble those of real genes
  • they often contain in-frame stops, or invalid or rare splice sites

common assumptions

  • no overlapping genes
  • no nested genes
  • no partial genes
  • no noncannonical signal consensus
  • no frameshifts or sequencing errors
  • optimal parse only
  • constraints on feature lenghs (not necessary for GHMM-based gene finders)
  • no split start codons
  • no split stop codons (CDS with inframe stops do occur)
  • no alternative splicing (?)
  • no selenocysteine codons (TGA)
  • no ambiguity codes (e.g., R, Y, N)
  • one haplotype only (SNP density in human: 1 per 1 kb)

how to work?

  • get all signals: return a list of all potential donnors / acceptors (GT/AG), start / stop codons (ATG/TAG TAA TGA)
  • construct all fowrard-strand ORFs: all intervals in the sequence beginning with start (ATG or AG) and end with stop (TG TAG TAA TGA)
  • find open reading frames

GC bias in CDS: in average %GC in CDS > GC in whole sequence, based on codon bias and WMM (weight matrix) score

Automated Models

  • Mathmatical models: Hidden Markov Models
  • Machine learning: Bayesian Networks and Decision Trees

First annotation pipelines but also consider other evidence: existing annotations (known genes), EST assembly, RepeatMasker predictions, Synteny info.

Cluster analysis

gene set enrichment analysis (GSEA) algorithms

  • connectivity-based (hierarchical clustering)
  • controid-based (K-means)
  • distribution-based
  • density-based

Additional features

  • promoters TATA box (TATA->CAP->5’UTR) and CCAAT box
  • the branch point (upstream of acceptor site)
  • CpG island: (preferential in upstream of human genes, esp. housekeeping genes)
  • signal peptide (aa seq near C-terminal of a newly translated protein)
  • polyA signals (3’UTR -> AATAAA)
  • 5’ and 3’ UTR
  • CAP site

Masking common practice: soft-mask low complexity repeats (can occur in CDS) and hard-mask all others

Alternative splicing

  • exon skipping
  • cassette exons
  • intron retention
  • alternative polyadenylation
  • alternative promoter recognition

ncRNAs

pred methods as evidence: secondary structure and conservation patterns across related species

Methods to predict secondary structure:

  • based on thermodynamics: finding the secondary structure that minimize the Gibbs free energy of the RNA molecule
  • through the use of stochastic context-free grammers (scFGs)

Promoters

  • largely within 40bp of TSS
  • most popular methods: weight matrix (WMM)
  • typical: McPromoter systems
  • UPR (upstream promoter region) -> TATA -> Spacer -> Inr (initiator) -> DPE (downstream promoter elements)
  • not all core promoters have TATA/Inr
  • another: CpG island strongly assoicated with promoters

differential expression

different activators signaling pathways by stimulus

a given gene may have multiple enhancers, each active at a different time or in a different cell type

each enhancer is associated with one gene

Z. Lu avatar
Z. Lu
Computer biologist, amature photographer, vintage fan and web lover.
comments powered by Disqus