Meta-analysis of RNAseq studies

(Title image from ExAtlas)

Raw sequence data

Apart from publications, we can find a list of rescources for genome-wide expression data (see Table 1).

wget / curl -o URL

Base quality

Sequence QC: fastQC, PRINSEQ, Trimmomatic

fastqc READSFILE

Filtering low-quality reads / trimming: Trimmomatic, FastX, PRINSEQ

Ambiguous bases
Adapters: TagCleaner, CutAdapt
Read length
Sequence-specific bias
GC-content
Duplicates (for DGE analysis not recommended to remove duplicates)
Sequence contamination
Low-complexity sequence / polyA tails

Mapping to reference genome

STAR, Bowtie, TopHat, Hisat2

STAR mapping [and counting, STAR counts pair-end reads as one read]

STAR --runThreadN 8 --genomeDir STAR_index_dir --readFilesIn XXX.fastq.gz --readFilesCommand zcat --alignIntronMin 10 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix XXX_ [--sjdbGTFfile REF.gtf --quantMode GeneCounts]

Hisat2 mapping

hisat2 -p 8 hisat2_index_dir -U XXX.fastq -q -S OUTFILE.sam

Mapping stats:

stamtools flagstat XXX.bam/.sam

or to use RseQC

Transcriptome assembly

Mapping-based assembly

Cufflinks and Scripture

de novo assembly

Trinity or Velvet + Oases

Read counts from reference annotation

Reads per gene

HTSeq-count (slow)

htseq-count -f bam -s reverse XXX.bam REF.gtf > XXX.counts.txt

The output of htseq-count is quite simple: gene and read counts

1
2
3
4


Smp_000020	1434
Smp_000030	6489
Smp_000040	1263
...

and featureCounts (fast).

featureCounts -t exon -g gene_id -a REF.gtf -o XXX.counts.txt FILEIN.sam

gives quite comprehensive output, including gene start, end, lenght, and reads from specific sam file

1
2
3
4
5
6


# Program:featureCounts v1.4.5-p1; Command: YOURCOMMAND
Geneid	Chr	Start	End	Strand	Length	SRR224XXX-v7.sam
Smp_099160      SM_V7_1;SM_V7_1;SM_V7_1;SM_V7_1 2236481;2236791;2238419;2239305 2236757;2236910;2239012;2239533 +;+;+;+ 1220    393
Smp_176700      SM_V7_1;SM_V7_1;SM_V7_1 2775958;2776028;2776342 2775993;2776109;2776574 +;+;+   351     3
Smp_213700      SM_V7_1;SM_V7_1 11237204;11239339       11238276;11240588       +;+     2323    21
...

Reads per transcript

Cufflinks, eXpress

Reads per exon

For differential exon usage: R package - DEXSeq

Data exploration

Correlation between biological samples, PCA/MDS/Sample Distance Matrix.

Data normalisation

RPKM/FPKM
TPM
TMM (edgeR): batch normalisation method
DESeq2 (can automatic filter lowly expressed genes)

Output can be normalised counts or rpkm or cpm.

Differential expression

Based on negative binomial distribution

edgeR: glmLRT(fit, contrast = contrast)
DESeq2: results(dds, contrast = (a, b))

Be aware of zero inflation.

Based on Possion-Tweedie

tweeDESeq

Visualisation

Bar/line chart, heatmap, volcano plot, MA plot, Idiogram, gene/transcript structures (R: ggbio)

Interactive visualisation: plotly, d3, highcharts, etc.

Functional

Blast, GO (GOstats, globaltest), Pathway (KEGG), domains (Batch CD-search), protein classification (NCBI SPARCLE)

Available whole pipeline tools:

Zika-RNAseq-Pipeline (Python + R)
RNA-seq workflow at the gene level

Meta-analysis examples:

Meta-analysis of RNA-seq expression data across species, tissues and studies (in Genome Biology 2015)
Differential meta-analysis of RNA-seq data from multiple studies (in BMC Bioinformatics 2014)
Meta-analysis of gene expression studies in endometrial cancer identifies gene expression profiles associated with aggressive disease and patient outcome (in Sci Rep 2016)