Meta-analysis of RNAseq studies

Meta-analysis of RNAseq studies

The are more and more RNAseq experiments coming out everyday. Data can be reused and meta-analysis can give more information. Here I list commonly used pipelines and tools.

(Title image from ExAtlas)

Raw sequence data

Apart from publications, we can find a list of rescources for genome-wide expression data (see Table 1).

wget / curl -o URL
  1. Base quality

Sequence QC: fastQC, PRINSEQ, Trimmomatic

fastqc READSFILE

Filtering low-quality reads / trimming: Trimmomatic, FastX, PRINSEQ

  1. Ambiguous bases

  2. Adapters: TagCleaner, CutAdapt

  3. Read length

  4. Sequence-specific bias

  5. GC-content

  6. Duplicates (for DGE analysis not recommended to remove duplicates)

  7. Sequence contamination

  8. Low-complexity sequence / polyA tails

Mapping to reference genome

STAR, Bowtie, TopHat, Hisat2

STAR mapping [and counting, STAR counts pair-end reads as one read]

STAR --runThreadN 8 --genomeDir STAR_index_dir --readFilesIn XXX.fastq.gz --readFilesCommand zcat --alignIntronMin 10 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix XXX_ [--sjdbGTFfile REF.gtf --quantMode GeneCounts]

Hisat2 mapping

hisat2 -p 8 hisat2_index_dir -U XXX.fastq -q -S OUTFILE.sam

Mapping stats:

stamtools flagstat XXX.bam/.sam

or to use RseQC

Transcriptome assembly

Mapping-based assembly

Cufflinks and Scripture

de novo assembly

Trinity or Velvet + Oases

Read counts from reference annotation

Reads per gene

HTSeq-count (slow)

htseq-count -f bam -s reverse XXX.bam REF.gtf > XXX.counts.txt

The output of htseq-count is quite simple: gene and read counts

Smp_000020	1434
Smp_000030	6489
Smp_000040	1263
...

and featureCounts (fast).

featureCounts -t exon -g gene_id -a REF.gtf -o XXX.counts.txt FILEIN.sam

gives quite comprehensive output, including gene start, end, lenght, and reads from specific sam file

# Program:featureCounts v1.4.5-p1; Command: YOURCOMMAND
Geneid	Chr	Start	End	Strand	Length	SRR224XXX-v7.sam
Smp_099160      SM_V7_1;SM_V7_1;SM_V7_1;SM_V7_1 2236481;2236791;2238419;2239305 2236757;2236910;2239012;2239533 +;+;+;+ 1220    393
Smp_176700      SM_V7_1;SM_V7_1;SM_V7_1 2775958;2776028;2776342 2775993;2776109;2776574 +;+;+   351     3
Smp_213700      SM_V7_1;SM_V7_1 11237204;11239339       11238276;11240588       +;+     2323    21
...

Reads per transcript

Cufflinks, eXpress

Reads per exon

For differential exon usage: R package - DEXSeq

Data exploration

Correlation between biological samples, PCA/MDS/Sample Distance Matrix.

Data normalisation

  • RPKM/FPKM
  • TPM
  • TMM (edgeR): batch normalisation method
  • DESeq2 (can automatic filter lowly expressed genes)

Output can be normalised counts or rpkm or cpm.

Differential expression

Based on negative binomial distribution

edgeR: glmLRT(fit, contrast = contrast)
DESeq2: results(dds, contrast = (a, b))

Be aware of zero inflation.

Based on Possion-Tweedie

tweeDESeq

Visualisation

Bar/line chart, heatmap, volcano plot, MA plot, Idiogram, gene/transcript structures (R: ggbio)

Interactive visualisation: plotly, d3, highcharts, etc.

Functional

Blast, GO (GOstats, globaltest), Pathway (KEGG), domains (Batch CD-search), protein classification (NCBI SPARCLE)


Available whole pipeline tools:


Meta-analysis examples:

Z. Lu avatar
Z. Lu
Data mining, bioinformatics, parasites, retro, plain text.
comments powered by Disqus