Topological sorting of gff features

Topological sorting of gff features

Tried several tools to sort gff topologically, incluing genometools, Perl script GFF3sort, Python package tag and pure shell command.

Figure from tag documentation

I have a gff file (test.gff) containing an example gene with two transcripts:

~~~~~~ SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS gene 103403 151162 0.12 - . ID=Smp_315690 SM_V7_1 AUGUSTUS mRNA 103403 151162 0.02 - . ID=Smp_315690.1;Parent=Smp_315690 SM_V7_1 AUGUSTUS mRNA 103403 151162 0.1 - . ID=Smp_315690.2;Parent=Smp_315690 SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 103441 103770 0.93 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 103441 103770 0.96 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 105920 106144 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 105920 106144 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 106876 107159 0.93 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 106876 107159 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 140582 140849 0.85 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 140582 140849 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.2 ~~~~~~

The gff is basically sorted by chromosome and coordinates via sort -k1,1 -k4,4n

The only problem is that it doesn’t maintain the hierarchical relationship among the features: gene, mRNA, exon/CDS, UTR etc. I know that I can use genometools command gt gff3 -retainids test.gff to get a perfect re-arrangement:

SM_V7_1	AUGUSTUS	gene	103403	151162	0.12	-	.	ID=Smp_315690
#
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.02	-	.	ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.93	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.1
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.1
#
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.1	-	.	ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.96	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	105920	106144	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	105920	106144	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	106876	107159	0.93	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	106876	107159	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	140582	140849	0.85	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	140582	140849	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	142981	143205	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	145395	145678	1	-	2	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	145395	145678	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	CDS	151075	151132	1	-	0	ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	exon	151075	151162	.	-	.	Parent=Smp_315690.2
SM_V7_1	AUGUSTUS	five_prime_UTR	151133	151162	.	-	.	Parent=Smp_315690.2

But since I couldn’t install genometools on my local computer, I am looking for an alternative to modify the Apollo-exported gff. First I came across the Python library tag, which is easy to install via pip and easy to use tag gff3 test.gff but it doesn’t retain the original IDs:

SM_V7_1	AUGUSTUS	gene	103403	151162	0.120	-	.	ID=gene1
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.020	-	.	ID=mRNA1;Parent=gene1
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=mRNA1
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=mRNA1
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.930	-	0	ID=CDS1;Parent=mRNA1
SM_V7_1	AUGUSTUS	exon	142981	143205	.	-	.	Parent=mRNA1
...
SM_V7_1	AUGUSTUS	mRNA	103403	151162	0.100	-	.	ID=mRNA2;Parent=gene1
SM_V7_1	AUGUSTUS	three_prime_UTR	103403	103440	.	-	.	Parent=mRNA2
SM_V7_1	AUGUSTUS	exon	103403	103770	.	-	.	Parent=mRNA2
SM_V7_1	AUGUSTUS	CDS	103441	103770	0.960	-	0	ID=CDS2;Parent=mRNA2

I had a look at their original code but couldn’t find a solution. Then I found the Perl script GFF3sort, which even has an associated publication (). For this I need to install several modules (Naturally.pm, experimental.pm, and Topological.pm) and I used one of the option to sort perl gff3sort.pl –precise test.gff, which seems kept the parental relationship but failed to devide the two mRNAs:

~~~~~~ SM_V7_1 AUGUSTUS gene 103403 151162 0.12 - . ID=Smp_315690 SM_V7_1 AUGUSTUS mRNA 103403 151162 0.02 - . ID=Smp_315690.1;Parent=Smp_315690 SM_V7_1 AUGUSTUS mRNA 103403 151162 0.1 - . ID=Smp_315690.2;Parent=Smp_315690 SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 103441 103770 0.93 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 103441 103770 0.96 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 105920 106144 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 105920 106144 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 106876 107159 0.93 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 106876 107159 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 140582 140849 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 140582 140849 0.85 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2 SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1 SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.2 SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.1 SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.2 ~~~~~~

Actually you can get a smililar result via the shell command through simple replacement:

  cat test.gff | sed 's/gene/AAA/; s/mRNA/BBB/'| sort -k1,1 -k4,4n | sed 's/AAA/gene/; s/BBB/mRNA/'

Overall sadly there is still no perfect local tool for me that is comparable to genometools.

Z. Lu avatar
Z. Lu
Data science, bioinfo, scripting, parasites, retro, plain text.
comments powered by Disqus