Featured image of post Topological sorting of gff features

Topological sorting of gff features

Tried several tools to sort gff topologically, incluing genometools, Perl script GFF3sort, Python package tag and pure shell command.

Figure from tag documentation

I have a gff file (test.gff) containing an example gene with two transcripts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	gene	103403	151162	0.12	-	.	ID=Smp_315690
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.02	-	.	ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.1	-	.	ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.93	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.96	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	105920	106144	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	105920	106144	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	106876	107159	0.93	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	106876	107159	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	140582	140849	0.85	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	140582	140849	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.2

The gff is basically sorted by chromosome and coordinates via sort -k1,1 -k4,4n

The only problem is that it doesn’t maintain the hierarchical relationship among the features: gene, mRNA, exon/CDS, UTR etc. I know that I can use genometools command gt gff3 -retainids test.gff to get a perfect re-arrangement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
SM_V7_1	AUGUSTUS	gene	103403	151162	0.12	-	.	ID=Smp_315690
#
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.02	-	.	ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.93	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.1
#
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.1	-	.	ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.96	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	105920	106144	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	105920	106144	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	106876	107159	0.93	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	106876	107159	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	140582	140849	0.85	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	140582	140849	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.2

But since I couldn’t install genometools on my local computer, I am looking for an alternative to modify the Apollo-exported gff. First I came across the Python library tag, which is easy to install via pip and easy to use tag gff3 test.gff but it doesn’t retain the original IDs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SM_V7_1	AUGUSTUS	gene	103403	151162	0.120	-	.	ID=gene1
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.020	-	.	ID=mRNA1;Parent=gene1
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=mRNA1
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=mRNA1
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.930	-	0	ID=CDS1;Parent=mRNA1
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=mRNA1
...
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.100	-	.	ID=mRNA2;Parent=gene1
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=mRNA2
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=mRNA2
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.960	-	0	ID=CDS2;Parent=mRNA2

I had a look at their original code but couldn’t find a solution. Then I found the Perl script GFF3sort, which even has an associated publication (). For this I need to install several modules (Naturally.pm, experimental.pm, and Topological.pm) and I used one of the option to sort perl gff3sort.pl –precise test.gff, which seems kept the parental relationship but failed to devide the two mRNAs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
SM_V7_1	AUGUSTUS	gene	103403	151162	0.12	-	.	ID=Smp_315690
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.02	-	.	ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.1	-	.	ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.93	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.96	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	105920	106144	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	105920	106144	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	106876	107159	0.93	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	106876	107159	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	140582	140849	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	140582	140849	0.85	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.2

Actually you can get a smililar result via the shell command through simple replacement:

1
sed 's/gene/AAA/; s/mRNA/BBB/' test.gff | sort -k1,1 -k4,4n | sed 's/AAA/gene/; s/BBB/mRNA/'

Overall sadly there is still no perfect local tool for me that is comparable to genometools.

Update: Jacques has made an excellent comparison using different tools/packages with explanations on how they sort. It looks like AGAT gives the desired order. Thanks Jacques!

comments powered by Disqus
CC-BY-NC 4.0
Built with Hugo Theme Stack