Perl script: text replacement based on string pairs

I’ve recently been doing identifier transfer work for the parasitic worm Schistosoma mansoni. We have two gene sets: the RATT transferred set from the old annotations, and the Augustus set with predicted annotations. The agreement is to keep the features in Augustus, but take the identifiers from RATT.

One approach I thought about is firstly to generate a confident identifier pair list (key-value pairs) like this:

1
2


g1	Smp_000000
g10	Smp_123456

Then use text replacement to change the IDs in Augustus with their RATT counterparts, but pay attention not to replace the string (like g1) in another string (like g10); and we wanted to keep the naming of transcript like “xxx.1” instead of “xxx.t1”.

I tried to use Bash associative array like what I did for pairwise sequence alignment, but I did not get any success. Then I relized that Perl has Hash and Python has Dictionary, and there might be script already available. Finally I came to a script like this (sadly at this time point I cannot find the original post) with a bit modifications:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


#!/usr/bin/perl

# usage: perl pair_replace.pl [Aug-RATT id pairs] [INPUT FILE] > OUTPUTFILE

use strict;
use warnings;

my %hsh=();

open (MYFILE, $ARGV[0]);
open (MYFILE1, $ARGV[1]);

while (<MYFILE>) {
my@arr = split/\s+/;
$hsh{$arr[0]} = $arr[1];
}
my $flag;
while(<MYFILE1>)
{
$flag=0;
my $line=$_;
foreach my $key (keys %hsh)
{
   if($line=~/$key($|\.t)/) # finding lines with string at line end or with string + ".t"
   {
    $flag=1;
    $line=~s/$key$/$hsh{$key}/g; # replace string at line end with the paired
    $line=~s/$key\.t/$hsh{$key}\./g; # replace string + ".t" with paired-string + "."
    print $line;
   }
}
  if($flag!=1)
  {
  print $line;
  $flag=0;
  }
}
close(MYFILE);
close(MYFILE1);

Bingo!

I made a first try and it took 11 hours to replace 13,300 string pairs.

UPDATE 26.05.2017 Changes were made in the script to make exact replacement (not to mis-replace “g1” in “g10”).