GO enrichment in topGO using KS test

Figure source: topGO manual

The most popular method to test GO enrichment in a list of genes is to use Hypegeometric test (one-side Fisher’s test). This method is based on genes counts for each GO term, but not considering like p-value or other weights. In fact, topGO supports three testing methods (see the manual):

Tests based on gene counts. This is the most popular family of tests, given that it only requires the presence of a list of interesting genes and nothing more. Tests like Fisher’s exact test, Hypegeometric test and binomial test belong to this family. Draghici et al. (2006)

Tests based on gene scores or gene ranks. It includes Kolmogorov-Smirnov like tests (also known as GSEA), Gentleman’s Category, t-test, etc. Ackermann and Strimmer (2009)
Tests based on gene expression. Tests like Goeman’s globaltest or GlobalAncova separates from the others since they work directly on the expression matrix. Goeman and Buhlmann (2007)

For the ks test, we can use gene ranks such as DGE p-values or fold-change values, or other ranking measurements. Nitsuaq here nicely discussed the difference between fisher and ks tests:

Fisher and ks are just two ways of answering the same question: are the most significant genes enriched for any particular GO term annotations? Fisher’s exact test compares the expected number of significant genes at random to the observed number of significant genes to arrive at a probability. The KS test compares the distribution of gene p-values expected at random to the observed distribution of the gene p-values to arrive at a probability. KS is theoretically the better choice because it does not require an arbitrary p-value threshold.

Back to the code part, the main difference from fisher’s test is that there is a gene selection function for geneList in the topGOdata object:

1
2
3
4
5
6
7
8
9


myGOdata <- new("topGOdata",
		description = "GO enrichment",
		ontology = 'BP',
		allGenes = geneList, # specifies all annotated genes
		geneSel = mySel, # the selection function
		annot = annFUN.gene2GO,
		gene2GO = ref.vec,
		nodeSize = 10, # minimal number of genes for a term
)

Here the geneSel function is called mySel. And unfortunately the current selection function only supports a simple score cutoff, rather than complex filtering and self-provided gene list (See the post here). So the gene scores need to be continuous, e.g., p-value cutoff for selecting DEGs.

1
2
3


mySel<-function (score) {
	return(score>=0.85)
}

In Fisher’s test, the geneList is simply a list of gene names, but for the ks test we need to store the gene scores in the toGOdata object, by providing the scores in geneList like this:

1
2
3
4
5


> head(geneList)
Smp_000880 Smp_001830 Smp_002080 Smp_002410 Smp_002600 Smp_002880
     0.766      0.837      0.749      0.752      0.732      0.710
> class(geneList)
[1] "numeric"

This can be done from importing the GO annotation file (“refGO”) with scores:

1
2
3


Smp_000880 0.766 GO:0015078,GO:0015986,GO:0015991,GO:0033177,GO:0045263
Smp_000900 0.704 GO:0004930,GO:0007186,GO:0016021
Smp_001000 0.755 GO:0006950

and assign the score values to geneList and turn it to numeric:

1
2


geneList<-refGO$score
names(geneList) <- refGO$id

Now if we check the topGOdata object the scores are stored there, which are not present in Fisher’s test

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


------------------------- topGOdata object -------------------------

 Description:
   -  topGO

 Ontology:
   -  BP

 291 available genes (all genes from the array):
   - symbol:  Smp_000880 Smp_001830 Smp_002080 Smp_002410 Smp_002600  ...
   - score :  0.766 0.837 0.749 0.752 0.732  ...
   - 51  significant genes.

 139 feasible genes (genes that can be used in the analysis):
   - symbol:  Smp_000880 Smp_002080 Smp_002600 Smp_002880 Smp_005720  ...
   - score :  0.766 0.749 0.732 0.71 0.815  ...
   - 30  significant genes.

Finally we can run the ks test:

resultWeight01KS <- runTest(myGOdata,algorithm="weight01",statistic="ks")

Of course we also would like to put genes for each significant term to GenTable from the significant list (see the complete [topGOdata class](topGodata class: https://www.rdocumentation.org/packages/topGO/versions/2.24.0/topics/topGOdata-class)):

1
2
3
4
5


allRes$genes <- sapply(allRes$GO.ID, function(x)
    {
      genes<-genesInTerm(myGOdata, x) 
      genes[[1]][genes[[1]] %in% sigGenes(myGOdata)] # significant genes
    })

But it seems that the ks test can only identify significant GO terms that are different between the selection and the rest, it won’t tell you which is enrichment in your gene selection (see the post from Fabio Marroni, who showed that the KS p-values and terms are the same for selecting either expressed or non-expressed genes), and the impression also gave that

the fisher test with p<0.01 and weight01 algorithm seemed to identify the informative GO terms, whereas KS and weight01 tended to identify very basic GO terms like biological process, or cellular process. – https://www.biostars.org/p/247636/