Perl script: gff to fasta

Using Perl to extract sequences based on gff annotation.

I was trying to extract aa sequences of my genes based on the annotations in GFF file and the whole genome sequence. I came across to a BioStar discussion with people providing a Perl script. I tried, and it worked nearly perfectly for the Augustus annotion GFF, but not for the converted gff from RATT embl file. It turned out that there are some rules about the structure of the gff file:

  • As mentioned in the thread, the script cannot extract sequences from the last gene record. You need to get it manually by swapping the last record with any other one
  • For each gene record, the features should in a specific order: Gene - mRNA/transcript - exon - CDS. Otherwise you might get a messy sequence
  • Sometimes you need to have the header “##gff-version 3”
  • Pay attention to column 9 the “;Parent=” structure.
Z. Lu avatar
Z. Lu
Data science, bioinfo, scripting, parasites, retro, plain text.
comments powered by Disqus