Use Python (BioPython and gffutils) to extract sequences for gene features.
Previous I have been using a Perl Script to extract aa and dna sequences from a gff file, but there were flaws in that script, which requires extra attention (e.g. header in the gff file; order of features; cannot get sequence of the last gene).
Since I have learned Python for a while, I am thinking to do it using Python to avoid possible mistakes of the Perl script. I know BioPython is good at handling sequences so there is no need to write the whole function from scratch. After all, this is what I found about parsing gff files (a bit suprised that it’s not been integrated into Biopython yet) using gffutils. Here are the steps:
Create a database for the gff file (flexible to query any feature)
import gffutils from Bio.Seq import Seq from Bio.Alphabet import generic_dna myGFF = YOUR_GFF_FILE db = gffutils.create_db(myGFF, ':memory:', merge_strategy="create_unique", keep_order=True)
This creates a database file in the memory, you can also create a real sqlite3 db file.
Once created, you can load the database without creating it again
db = gffutils.FeatureDB('myGFF.db')
Because ‘gene’ and ‘transcript/mRNA’ are in kind of top level, and ‘exon’ ‘CDS’ are in child level (you need to combined multiple sequences), I wrote two functions for each:
myFasta = FASTA_GENOME_SEQUENCE ## function to get all gene/transcript(mRNA) sequences def parent_seq(type): for p in db.features_of_type(type): p_seq = p.sequence(myFasta) p_seq = Seq(p_seq, generic_dna) if p.strand == '-': p_seq = p_seq.reverse_complement() print('>' + p.id + '\n' + p_seq) ## function to get all cds/exon/intron sequences under the same transcript id def child_seq(type): for t in db.features_of_type('transcript', order_by='start'): # or mRNA depending on the gff # print(t.sequence(myFasta)) seq_combined = '' for i in db.children(t, featuretype=type, order_by='start'): # or exon/intron seq = i.sequence(myFasta, use_strand=True) # use_strand doesn't work; have to revcomp seq_combined += seq seq_combined = Seq(seq_combined, generic_dna) if t.strand == '-': seq_combined = seq_combined.reverse_complement() print('>' + t.id + '\n' + seq_combined)
Note that in the Feature.sequence function, the option ‘use_strand = True’ (default) is still not working, and you need to reverse complement the sequence manually.
The next option is to translate all CDS sequences into AA sequence, which uses the ‘.translate()’ function (I worte another function here):
## function to get all protein sequences under the same transcript id def pep_seq(): for t in db.features_of_type('transcript', order_by='start'): # or mRNA depending on the gff # print(t.sequence(myFasta)) print('>' + t.id) seq_combined = '' for i in db.children(t, featuretype='CDS', order_by='start'): # or exon/intron seq = i.sequence(myFasta, use_strand=True) # use_strand doesn't work; have to revcomp seq_combined += seq seq_combined = Seq(seq_combined, generic_dna) if t.strand == '-': seq_combined = seq_combined.reverse_complement() seq_transl = seq_combined.translate() for i in range(0, len(seq_transl), 60): # print 60 bases per line print(seq_transl[i:i+60])
When you run this you may get the warning message into the output:
BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
To avoid the message in the output:
## ignore biopython warnings import warnings from Bio import BiopythonWarning warnings.simplefilter('ignore', BiopythonWarning)
The next thing will be to extract upstream sequences of TSS.
Check the complete script on my Github.