Use Python (BioPython and gffutils) to extract sequences for gene features.
Previous I have been using a Perl Script to extract aa and dna sequences from a gff file, but there were flaws in that script, which requires extra attention (e.g. header in the gff file; order of features; cannot get sequence of the last gene).
Since I have learned Python for a while, I am thinking to do it using Python to avoid possible mistakes of the Perl script. I know BioPython is good at handling sequences so there is no need to write the whole function from scratch. After all, this is what I found about parsing gff files (a bit suprised that it’s not been integrated into Biopython yet) using gffutils. Here are the steps:
Create a database for the gff file (flexible to query any feature)
This creates a database file in the memory, you can also create a real sqlite3 db file.
Once created, you can load the database without creating it again
db = gffutils.FeatureDB('myGFF.db')
Because ‘gene’ and ‘transcript/mRNA’ are in kind of top level, and ‘exon’ ‘CDS’ are in child level (you need to combined multiple sequences), I wrote two functions for each:
myFasta=FASTA_GENOME_SEQUENCE## function to get all gene/transcript(mRNA) sequencesdefparent_seq(type):forpindb.features_of_type(type):p_seq=p.sequence(myFasta)p_seq=Seq(p_seq,generic_dna)ifp.strand=='-':p_seq=p_seq.reverse_complement()print('>'+p.id+'\n'+p_seq)## function to get all cds/exon/intron sequences under the same transcript iddefchild_seq(type):fortindb.features_of_type('transcript',order_by='start'):# or mRNA depending on the gff# print(t.sequence(myFasta))seq_combined=''foriindb.children(t,featuretype=type,order_by='start'):# or exon/intronseq=i.sequence(myFasta,use_strand=True)# use_strand doesn't work; have to revcompseq_combined+=seqseq_combined=Seq(seq_combined,generic_dna)ift.strand=='-':seq_combined=seq_combined.reverse_complement()print('>'+t.id+'\n'+seq_combined)
Note that in the Feature.sequence function, the option ‘use_strand = True’ (default) is still not working, and you need to reverse complement the sequence manually.
The next option is to translate all CDS sequences into AA sequence, which uses the ‘.translate()’ function (I worte another function here):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## function to get all protein sequences under the same transcript iddefpep_seq():fortindb.features_of_type('transcript',order_by='start'):# or mRNA depending on the gff# print(t.sequence(myFasta))print('>'+t.id)seq_combined=''foriindb.children(t,featuretype='CDS',order_by='start'):# or exon/intronseq=i.sequence(myFasta,use_strand=True)# use_strand doesn't work; have to revcompseq_combined+=seqseq_combined=Seq(seq_combined,generic_dna)ift.strand=='-':seq_combined=seq_combined.reverse_complement()seq_transl=seq_combined.translate()foriinrange(0,len(seq_transl),60):# print 60 bases per lineprint(seq_transl[i:i+60])
When you run this you may get the warning message into the output:
BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.