Get the upstream sequence of gene/TSS

Looking for a method to get the upstream sequence (e.g. promoter) of a gene / TSS.

After searching the web, here are a few methods:

samtools

samtools faidx [FASTA file] [Contig]:[Start]-[End]

e.g., samtools faidx ~/Smansoni/V7/Smansoni_v7.fa SM_V7_1:1-10

Similar tool in Python is pyfaidx or pyfasta

Using BioMart

Python script

There is a Python script available for this purpose by peterthorpe5: https://github.com/peterthorpe5/public_scripts/blob/master/genomic_upstream_regions/get_upstream_regions.py

We can also write a Python script using the Seq module and the gffutils + pyfasta package. This will need a gff3 file and a genome fasta file. We can extract upstream regions either based on gene or mRNA feature.

Here is the core part:

def upstream_seq(type):
    for p in db.features_of_type(type):
        genomefa = Fasta(myFasta)
        print('>' + p.id + "_Upstream" + str(myRange))
        upstart = p.start - 1 - myRange
        upend = p.start -1
        # get sequence based on coordinates (start is 0-based)
        p_upstream = genomefa[p.seqid][upstart:upend]
        if p.strand == '-':
            upstart = p.end
            upend = p.end + myRange
            p_upstream = genomefa[p.seqid][upstart:upend]
            p_upstream = Seq(p_upstream).reverse_complement()
        for i in range(0, len(p_upstream), 60): # print 60 bases per line
            print(p_upstream[i:i+60])

Full script is available on Github.

Z. Lu avatar
Z. Lu
Computer biologist, amature photographer, vintage fan and web lover.
comments powered by Disqus