Figure from tag documentation
I have a gff file (test.gff) containing an example gene with two transcripts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS gene 103403 151162 0.12 - . ID=Smp_315690
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.02 - . ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.1 - . ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 103441 103770 0.93 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 103441 103770 0.96 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 105920 106144 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 105920 106144 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 106876 107159 0.93 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 106876 107159 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 140582 140849 0.85 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 140582 140849 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.2
|
The gff is basically sorted by chromosome and coordinates via sort -k1,1 -k4,4n
The only problem is that it doesn’t maintain the hierarchical relationship among the features: gene, mRNA, exon/CDS, UTR etc. I know that I can use genometools command gt gff3 -retainids test.gff to get a perfect re-arrangement:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
SM_V7_1 AUGUSTUS gene 103403 151162 0.12 - . ID=Smp_315690
#
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.02 - . ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 103441 103770 0.93 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.1
#
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.1 - . ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 103441 103770 0.96 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 105920 106144 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 105920 106144 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 106876 107159 0.93 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 106876 107159 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 140582 140849 0.85 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 140582 140849 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.2
|
But since I couldn’t install genometools on my local computer, I am looking for an alternative to modify the Apollo-exported gff. First I came across the Python library tag, which is easy to install via pip and easy to use tag gff3 test.gff but it doesn’t retain the original IDs:
1
2
3
4
5
6
7
8
9
10
11
|
SM_V7_1 AUGUSTUS gene 103403 151162 0.120 - . ID=gene1
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.020 - . ID=mRNA1;Parent=gene1
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=mRNA1
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=mRNA1
SM_V7_1 AUGUSTUS CDS 103441 103770 0.930 - 0 ID=CDS1;Parent=mRNA1
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=mRNA1
...
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.100 - . ID=mRNA2;Parent=gene1
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=mRNA2
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=mRNA2
SM_V7_1 AUGUSTUS CDS 103441 103770 0.960 - 0 ID=CDS2;Parent=mRNA2
|
I had a look at their original code but couldn’t find a solution. Then I found the Perl script GFF3sort, which even has an associated publication (). For this I need to install several modules (Naturally.pm, experimental.pm, and Topological.pm) and I used one of the option to sort perl gff3sort.pl –precise test.gff, which seems kept the parental relationship but failed to devide the two mRNAs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
SM_V7_1 AUGUSTUS gene 103403 151162 0.12 - . ID=Smp_315690
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.02 - . ID=Smp_315690.1;Parent=Smp_315690
SM_V7_1 AUGUSTUS mRNA 103403 151162 0.1 - . ID=Smp_315690.2;Parent=Smp_315690
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS three_prime_UTR 103403 103440 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 103403 103770 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 103441 103770 0.93 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 103441 103770 0.96 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 105920 106144 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 105920 106144 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 106876 107159 0.93 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 106876 107159 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 140582 140849 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 140582 140849 0.85 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 142981 143205 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 142981 143205 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 145395 145678 1 - 2 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 145395 145678 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.2.cds;Parent=Smp_315690.2
SM_V7_1 AUGUSTUS CDS 151075 151132 1 - 0 ID=Smp_315690.1.cds;Parent=Smp_315690.1
SM_V7_1 AUGUSTUS exon 151075 151162 . - . Parent=Smp_315690.2
SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.1
SM_V7_1 AUGUSTUS five_prime_UTR 151133 151162 . - . Parent=Smp_315690.2
|
Actually you can get a smililar result via the shell command through simple replacement:
1
|
sed 's/gene/AAA/; s/mRNA/BBB/' test.gff | sort -k1,1 -k4,4n | sed 's/AAA/gene/; s/BBB/mRNA/'
|
Overall sadly there is still no perfect local tool for me that is comparable to genometools.
Update:
Jacques has made an excellent comparison using different tools/packages with explanations on how they sort. It looks like AGAT gives the desired order. Thanks Jacques!