Replace text based on string pairs using Python

This script is for the same task as in a previous post (Perl based), but in Python.

Briefly, you have a GFF file and would like to change the gene IDs (such as g10, g100) with the paired IDs in another file (such as Smp_300010, Smp_300100), besides the whole string replacement, you also would like to change IDs such as g10.t1 with Smp_300010.1 (removing the ’t' in the middle).

Just worked out a Python solution (given the IDPAIRS, GFF, AND FILEOUT as arguments):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


# usage: python Pair_Replace.py <idpair> <gff> <output>

import sys
import re

IDPAIRS = sys.argv[1] # tab separated
GFF = sys.argv[2] # gff file
FILEOUT = sys.argv[3] # output file

d = {} # make a dictionary for raw id pairs
dt = {} # make another dictionary with keys + '.t'
with open(IDPAIRS) as f1:
    for line1 in f1:
        line1 = line1.rstrip()
        (aug, smp) = line1.split("\t")
        d[aug] = smp
        dt[(aug + '.t')] = smp

with open(GFF) as f2:
    with open(FILEOUT, 'w') as fo:
        for line2 in f2:
            line2 = line2.rstrip()
            pat1 = re.compile(r'\b(' + '|'.join(dt.keys()) + r')') # \b(key1.t|key2.t|key...)
            pat2 = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b') # \b(key1|key2|key...)\b
            s1 = pat1.sub(lambda x: dt[x.group()] + '.', line2) # replace 'g1.t1' with 'Smp_1.1'
            s2 = pat2.sub(lambda x: d[x.group()], s1)   # replace 'g1' with 'Smp_1'
            fo.write(s2 + "\n")
    fo.close()