Get fasta sequences for features in a gff file using Python

Previous I have been using a Perl Script to extract aa and dna sequences from a gff file, but there were flaws in that script, which requires extra attention (e.g. header in the gff file; order of features; cannot get sequence of the last gene).

Since I have learned Python for a while, I am thinking to do it using Python to avoid possible mistakes of the Perl script. I know BioPython is good at handling sequences so there is no need to write the whole function from scratch. After all, this is what I found about parsing gff files (a bit suprised that it’s not been integrated into Biopython yet) using gffutils. Here are the steps:

Create a database for the gff file (flexible to query any feature)

1
2
3
4
5


import gffutils
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
myGFF = YOUR_GFF_FILE
db = gffutils.create_db(myGFF, ':memory:', merge_strategy="create_unique", keep_order=True)

This creates a database file in the memory, you can also create a real sqlite3 db file.

Once created, you can load the database without creating it again

db = gffutils.FeatureDB('myGFF.db')

Because ‘gene’ and ‘transcript/mRNA’ are in kind of top level, and ‘exon’ ‘CDS’ are in child level (you need to combined multiple sequences), I wrote two functions for each:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


myFasta = FASTA_GENOME_SEQUENCE

## function to get all gene/transcript(mRNA) sequences
def parent_seq(type):
    for p in db.features_of_type(type):
        p_seq = p.sequence(myFasta)
        p_seq = Seq(p_seq, generic_dna)
        if p.strand == '-':
            p_seq = p_seq.reverse_complement()
        print('>' + p.id + '\n' + p_seq)

## function to get all cds/exon/intron sequences under the same transcript id
def child_seq(type):
    for t in db.features_of_type('transcript', order_by='start'): # or mRNA depending on the gff
        # print(t.sequence(myFasta))
        seq_combined = ''
        for i in db.children(t, featuretype=type, order_by='start'): # or exon/intron
            seq = i.sequence(myFasta, use_strand=True)  # use_strand doesn't work; have to revcomp
            seq_combined += seq
        seq_combined = Seq(seq_combined, generic_dna)
        if t.strand == '-':
            seq_combined = seq_combined.reverse_complement()
        print('>' + t.id + '\n' + seq_combined)

Note that in the Feature.sequence function, the option ‘use_strand = True’ (default) is still not working, and you need to reverse complement the sequence manually.

The next option is to translate all CDS sequences into AA sequence, which uses the ‘.translate()’ function (I worte another function here):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


## function to get all protein sequences under the same transcript id
def pep_seq():
    for t in db.features_of_type('transcript', order_by='start'): # or mRNA depending on the gff
        # print(t.sequence(myFasta))
	print('>' + t.id)
        seq_combined = ''
        for i in db.children(t, featuretype='CDS', order_by='start'): # or exon/intron
            seq = i.sequence(myFasta, use_strand=True)  # use_strand doesn't work; have to revcomp
            seq_combined += seq
        seq_combined = Seq(seq_combined, generic_dna)
        if t.strand == '-':
            seq_combined = seq_combined.reverse_complement()
        seq_transl = seq_combined.translate()
        for i in range(0, len(seq_transl), 60): # print 60 bases per line
            print(seq_transl[i:i+60])

When you run this you may get the warning message into the output:

BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.

To avoid the message in the output:

1
2
3
4


## ignore biopython warnings
import warnings
from Bio import BiopythonWarning
warnings.simplefilter('ignore', BiopythonWarning)

The next thing will be to extract upstream sequences of TSS.

Check the complete script on my Github.