Sequencing comparison using ACT

To use ACT we will need three essential files: sequence 1, sequence 2, and the comparison file (eg blast output). Additionally we can load annotation file to show features in the sequences (eg. gff or embl etc).

I my case I wanted to compare a gene or part of scaffold with a haplotype contig

gene to contig

I will need to

extract gene sequence from gff
compare gene sequence with contig sequence, ie. blastn
modify the gene annotation to fit it’s sequence coordinate (eg. starting from 1)
load gene seq (sequence 1), blastn output (comparison file), and contig seq (sequence 2) to ACT
load gff annotations to sequence 1 and 2 to disply features

I have a bash script to do this (file_act_gene.sh). To load gff quicker the gff needs bgzip and tabix

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


#!/bin/bash

## usage: ./file_act_gene.sh [gene id] [contig No]

### get gene of target
grep $1 Sm_v7.2_Basic.gff > $1.gff
python ~/Google\ Drive/CodingPractice/py/gff2fasta.py $1.gff Smansoni_v7.fa gene > $1.fa

### blast gene with contig
blastn -query $1.fa -subject SM_V7_${2}.fa -outfmt 6 > $1-$2.blast.out

### annotation for contig
grep SM_V7_$2 Sm_v7.2_Basic.gff | sort -k1,1 -k4,4n > $2.gff
bgzip $2.gff
tabix $2.gff.gz

### annotation with relative coordinates
START="$(grep gene $1.gff|awk '{print $4}')"
cat $1.gff | awk '{print $1,$2,$3,$4-"'$START'"+1, $5-"'$START'"+1,$6,$7,$8,$9}'|tr ' ' '\t'> $1_rel.gff
rm -f $1.gff

### load into ACT: $1.fa, SM_V7_${2}.fa, $1-$2.blast.out
### read after: $1_rel.gff $2.gff.gz

part of a scaffold to a contig

I need to supply the start and end coordinates of a target scaffold. To extract the sequence I used my colleague Alan’s python script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


#!/bin/bash

## usage: ./file_act_coord.sh [scaffold No] [start Coord] [end Coord] [haplo contig No]

### get sequence of range
python alan_extract_fa.py Smansoni_v7.fa SM_V7_$1 $2 $3 > SM_V7_$1_$2-$3.fa

### blast fraction to contig
blastn -query SM_V7_$1_$2-$3.fa -subject SM_V7_${4}.fa -outfmt 6 > SM_V7_$1_$2-$3_$4.blast.out

### annotation for contig
grep SM_V7_$4 Sm_v7.2_Basic.gff | sort -k1,1 -k4,4n > $4.gff
bgzip $4.gff
tabix $4.gff.gz

### annotation for selected region 
#### pay attention to usage of variable in awk
declare -i START=$2
declare -i END=$3

cat Sm_v7.2_Basic.gff | awk '$1=="SM_V7_""'$1'"' | awk '$4>='$START' && $5<='$END'' | awk '{print "SM_V7_""'$1'""_""'$2'""-""'$3'",$2,$3,$4-'$START'+1, $5-'$START'+1,$6,$7,$8,$9}'|awk '$4>0'|tr ' ' '\t' |sort -k1,1 -k4,4n> SM_V7_$1_$2-$3_rel.gff

### load into ACT: SM_V7_$1_$2-$3.fa, SM_V7_${4}.fa, SM_V7_$1_$2-$3_$4.blast.out
### read after: SM_V7_$1_$2-$3_rel.gff, $4.gff.gz

Just another tip for Artemis, after loading the fasta file, use the menu “Read Entry” instead of “Read Entry into” to load gff. The latter option can cause an error.