This script is for the same task as in a previous post (Perl based), but in Python.
Briefly, you have a GFF file and would like to change the gene IDs (such as g10, g100) with the paired IDs in another file (such as Smp_300010, Smp_300100), besides the whole string replacement, you also would like to change IDs such as g10.t1 with Smp_300010.1 (removing the ’t' in the middle).
Just worked out a Python solution (given the IDPAIRS, GFF, AND FILEOUT as arguments):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
# usage: python Pair_Replace.py <idpair> <gff> <output>
import sys
import re
IDPAIRS = sys.argv[1] # tab separated
GFF = sys.argv[2] # gff file
FILEOUT = sys.argv[3] # output file
d = {} # make a dictionary for raw id pairs
dt = {} # make another dictionary with keys + '.t'
with open(IDPAIRS) as f1:
for line1 in f1:
line1 = line1.rstrip()
(aug, smp) = line1.split("\t")
d[aug] = smp
dt[(aug + '.t')] = smp
with open(GFF) as f2:
with open(FILEOUT, 'w') as fo:
for line2 in f2:
line2 = line2.rstrip()
pat1 = re.compile(r'\b(' + '|'.join(dt.keys()) + r')') # \b(key1.t|key2.t|key...)
pat2 = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b') # \b(key1|key2|key...)\b
s1 = pat1.sub(lambda x: dt[x.group()] + '.', line2) # replace 'g1.t1' with 'Smp_1.1'
s2 = pat2.sub(lambda x: d[x.group()], s1) # replace 'g1' with 'Smp_1'
fo.write(s2 + "\n")
fo.close()
|