Notes on RNAseq Data Analysis - a Practical Approach

Notes on RNAseq Data Analysis - a Practical Approach

Fragmented notes from the book RNAseq Data Analysis - a Practical Approach

Structure

Primary

  • promoters
  • intron-exon junctions
  • 5’ 3’ UTRs
  • polyA site

Secondary

Tertiary

RNAseq can provide

  • 5’ TSS
  • 5’ UTR
  • exon-intron boundaries
  • 3’ UTR
  • polyA site
  • alternative usage of any of above

Fusion genes: cytogenetic derangements; genomic amplification; translocation deletions

long non-coding RNAs (lncRNAs)

  • 200nt

  • not overlap protein coding exons
  • can control transcription as enhancers (eRNA), competitors (ceRNA) or as noise

small non-coding RNAs

  • miRNA (21-23 nt)
  • piRNA
  • endo-siRNA
  • snoRNA
  • snRNA
  • tRNA (73-93 nt)
  • moRNA
  • eRNA

Reads coverage:

  • Illumina for coverage
  • SOLID for accuracy
  • Roche 454 / Pacbio for length

Pre-processing: remove low-quality baseds / artifacts, incl. adapters / lib-construct sequences

QC

  • base quality: filtering low-quality bases (Trimmomatic / FastX / prinSeq)
  • ambiguous bases
  • adapters (TagCleaner / CutAdapt)
  • read length
  • sequence-specific bias
  • GC-content
  • duplicates (for DGE not recommend to remove)
  • sequence contamination
  • low-complexity sequences / polyA tails

mapping stats: samtools or RseQC

de novo assembly

different from genome assembly; de Bruijn Graph

  • mapping based assembly: Cufflinks and Scripture
  • de novo assembly: Velvet + Oases; Trinity (Inchworm-Chrysalis-Butterfly)

Read mapping

  • reads per gene: htseq-count / Qualimap / Bedtools / Cufflinks: differ in how to handle multimapping reads
  • reads per transcript: Expectation Maximization (EM) Approach: Cufflinks / eXpress
  • reads per exon: DEXSeq

DEXSeq

  • input from GTF + bam/sam
  • counts per exon (from script) -> table in R
  • normalise by estimating size factor
  • estimation exon-specific dispersion values
  • testing for diff exon usage
  • can be used for alternative splicing preidction ExonCountSet

DE analysis

the same gene across different cells follow log-normal distribution (qPCR)

different individuals viariability: negative binomial distribution (DESeq / edgeR)

zero inflation hard to fit netative binomial model

tweeDESeq: Possion-Tweedie family

Normalisation

  • RPKM/FPKM
  • TPM
  • TMM

small ncRNAs

miRNA

  • miRdeep2
  • miRanalyzer

miRNA target (SVMs)

complementary of first 7-8 nt to mRNA

thermodynamic stability positionof particular GC/AU matches

  • targetScan
  • DIANA-microT

data bases:

  • microRNA.org
  • miRBase
  • piRNABank
  • Rfam
  • miRGator
  • mirWIP
  • TarBase
  • miRTarBase
  • RNAmmer

tRNA prediction

  • tRNA-Scan-SE
Z. Lu avatar
Z. Lu
Data scientist, bioinformatician, retro fan and web lover.
comments powered by Disqus