Replace text based on string pairs using Python

In practice this is used to replace all IDs in a GFF file with your desired IDs, e.g. g10 to Smp_300010 but also g10.t1 to Smp_300010.1 (without 't').

This script is for the same task as in a previous post (Perl based), but in Python.

Briefly, you have a GFF file and would like to change the gene IDs (such as g10, g100) with the paired IDs in another file (such as Smp_300010, Smp_300100), besides the whole string replacement, you also would like to change IDs such as g10.t1 with Smp_300010.1 (removing the ‘t’ in the middle).

Just worked out a Python solution (given the IDPAIRS, GFF, AND FILEOUT as arguments):

# usage: python <idpair> <gff> <output>

import sys
import re

IDPAIRS = sys.argv[1] # tab separated
GFF = sys.argv[2] # gff file
FILEOUT = sys.argv[3] # output file

d = {} # make a dictionary for raw id pairs
dt = {} # make another dictionary with keys + '.t'
with open(IDPAIRS) as f1:
    for line1 in f1:
        line1 = line1.rstrip()
        (aug, smp) = line1.split("\t")
        d[aug] = smp
        dt[(aug + '.t')] = smp

with open(GFF) as f2:
    with open(FILEOUT, 'w') as fo:
        for line2 in f2:
            line2 = line2.rstrip()
            pat1 = re.compile(r'\b(' + '|'.join(dt.keys()) + r')') # \b(key1.t|key2.t|key...)
            pat2 = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b') # \b(key1|key2|key...)\b
            s1 = pat1.sub(lambda x: dt[] + '.', line2) # replace 'g1.t1' with 'Smp_1.1'
            s2 = pat2.sub(lambda x: d[], s1)   # replace 'g1' with 'Smp_1'
            fo.write(s2 + "\n")
Z. Lu avatar
Z. Lu
Data scientist, bioinformatician, retro fan and web lover.
comments powered by Disqus