Get the UTR region for non-annotated transcripts

In the annotation we often have mRNA and CDS and often no UTR annotated. Therefore for genes

on plus (+) strand without 3’UTR: mRNA end_coord = last CDS end_coord
on minus (-) strand without 3’URT: mRNA start_coord = last CDS start_coord

We wanted to extract a pre-defined 3’UTR region (1kb downstream 3’end). Here is a workflow:

1. check if there is any gene with an 3’UTR
1. if no UTR, we can calculate the distance between transcript A end to the start of transcript/gene B (take only 1 transcript per gene), on each scaffold
1. check the distance, if <= 1kb then take the distance; if >1kb then take 1kb. Modify the coordinates (pay attention to the strandness) and output to gff fomat
1. extract sequences

1. To check if any gene has 3’UTR

The idea is to compare the coordinates between mRNA and the last CDS. Apollo always output CDS in their structural order, rather than coordinates, so the last CDS is always at the bottom. For other programs the CDS might be ordered by coordinates.

After grepping the transcript features, we can use the bash command awk 'NR==1; END{print}' to print first (mRNA) and last (last CDS) line, and compare the start/end coordinates (if uniq then gives 1, otherwise 2)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


#!/bin/bash

for id in `cat allmRNA.ids`; do
  printf $id
  # based on the sorted gff3 from Apollo/WBPS; CDS are ordered by their order instead of coordinates
  grep $id *.gff3 | grep 'mRNA\|CDS'|awk 'NR==1; END{print}' |awk '$7=="+"' | awk '{print $5}'| sort -u | wc -l | tr '\n' ' ' 
  grep $id *.gff3 | grep 'mRNA\|CDS'|awk 'NR==1; END{print}' |awk '$7=="-"' | awk '{print $4}'| sort -u | wc -l
  # if CDS sorted by coordinates: 
  #grep $id *.gff3 | grep 'mRNA\|CDS'|sed 's/mRNA/AmRNA/'|sort -k3,3 -k4,4n |awk '$7=="+"' |awk 'NR==1; END{print}'| awk '{print $5}'| sort -u | wc -l | tr '\n' ' ' 
  #grep $id *.gff3 | grep 'mRNA\|CDS'|sed 's/mRNA/AmRNA/'|sort -k3,3 -k4,4n |awk '$7=="-"' | head -n2| awk '{print $4}'| sort -u | wc -l
done

The output looks like this. We need to check if there is any transcript with 2 in their 2nd or 3rd column.

1
2


SSTP_0001282200.1       1        0
SSTP_0001289600.1       0        1

2. Calculating the distance between 3’end and next gene Start

In case of no UTR annotation for all genes, we first prepare a list of primary transcripts (.1), and order them by scaffolds and start coordinates. So they will appear adjacent.

If gene A is on the + strand, we compare it to the next gene B
- if the next gene B is on the same scaffold, then we take distance as A_end to B_start
- if the next gene B is not on the same scaffold, then gene A is the last gene, we take A_end to scaffold_end.
If gene A is on the - strand, we compare it to the previous gene C
- if the previous gene C is on the same scaffold, then we take distance as C_end to A_start
- if the previous gene C is not on the same scaffold, then gene A is the first gene, we take scaffold start (1) to A_start.

We can wrap it into R script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


# import mRNA ordered list and contig lengths
sst.mRNAs<-read.delim("all_mRNAs_sorted.txt", sep="\t", header=F)
colnames(sst.mRNAs)<-c("scaffold", "source", "type", "start", "end", "score", "strand", "phase", "attributes")
contiglen<-read.delim("contig_lengths.txt", header=T, sep="\t", row.names = "contig")

# take only primary transcripts
sst.uniq<-subset(sst.mRNAs, grepl('.1$', sst.mRNAs$attributes))

# output the calculation to a file
sink("Calculated_3UTR.txt")
sink(stdout(), type = "message")

# ***need to calculate and add first and last transcript manually***
for (row in 2:nrow(sst.uniq)) {
  # + strand compared to the next gene
  if (sst.uniq[row, "strand"] == "+") {
    # if the following transcript is on the same scaffold
    if (sst.uniq[row, "scaffold"] == sst.uniq[row+1, "scaffold"]) {
      message(sst.uniq[row, "scaffold"], "\t", sst.uniq[row, "end"]+1, "\t", sst.uniq[row+1, "start"]-1, "\t", "+", "\t", sst.uniq[row, "attributes"])
  } else {
    # if the following transcript is not on the same scaffold
      message(sst.uniq[row, "scaffold"], "\t", sst.uniq[row, "end"]+1, "\t", contiglen[as.character(sst.uniq[row, "scaffold"]), "length"], "\t", "+", "\t", sst.uniq[row, "attributes"])
    }
  }  else { # - strand: compared to the previous gene
    if (sst.uniq[row, "scaffold"] == sst.uniq[row-1, "scaffold"]) {
      message(sst.uniq[row, "scaffold"], "\t", sst.uniq[row-1, "end"]+1, "\t", sst.uniq[row, "start"]-1, "\t", "-", "\t", sst.uniq[row, "attributes"])
    } else {
      # if the previous gene is not on the same scaffold
      message(sst.uniq[row, "scaffold"], "\t", 1, "\t", sst.uniq[row, "start"]-1, "\t", "-", "\t", sst.uniq[row, "attributes"])
    }
  }
}
sink()

Output gives the distances

1
2


SSTP_contig0000001	1	220	-	ID=transcript:SSTP_0000548400.1
SSTP_contig0000001	6123	7884	+	ID=transcript:SSTP_0000548500.1

for + strand: col3 comes from mRNA 3’end, col4 comes from start_coord of the next gene
for - strand: col3 comes from end_coord of the previous gene, col4 comes from the mRNA 3’end

3. Adjusting the distance/length

Now it is easy, if the distance is <=1kb then take the distance, if not then adjust the coordinates

1
2
3
4
5
6


# get those with a distance <999 
awk '{print $3-$2, $0}' Calculated_3UTR.txt  | sed 's/:/ /' | awk '$1<999' | awk '{print $2 "\t" "cal_3UTR" "\t" "gene" "\t" $3 "\t" $4 "\t" "." "\t" $5 "\t" "." "\t" "ID=" $7 "|" $2 "|" $3 "-" $4}' > part1-noChange.txt

# modify those >=999; +/- strand treated differently
awk '{print $3-$2, $0}' Calculated_3UTR.txt  | sed 's/:/ /' | awk '$1>=999' | awk '$5=="+"' | awk '{print $2 "\t" "cal_3UTR" "\t" "gene" "\t" $3 "\t" $3+999 "\t" "." "\t" $5 "\t" "." "\t" "ID=" $7"|" $2 "|" $3 "-" $3+999}' > part2-1.txt
awk '{print $3-$2, $0}' Calculated_3UTR.txt  | sed 's/:/ /' | awk '$1>=999' | awk '$5=="-"' | awk '{print $2 "\t" "cal_3UTR" "\t" "gene" "\t" $4-999 "\t" $4 "\t" "." "\t" $5 "\t" "." "\t" "ID=" $7"|" $2 "|" $4-999 "-" $4}' > part2-2.txt

for example:

1
2
3
4


1761 SSTP_contig0000001	6123	7884	+	ID=transcript SSTP_0000548500.1
2829 SSTP_contig0000001	43274	46103	+	ID=transcript SSTP_0000549600.1
2826 SSTP_contig0000001	20231	23057	-	ID=transcript SSTP_0000549000.1
1022 SSTP_contig0000001	28577	29599	-	ID=transcript SSTP_0000549300.1

now becomes

1
2
3
4


SSTP_contig0000001	cal_3UTR	gene	6123	7122	.	+	.	ID=SSTP_0000548500.1
SSTP_contig0000001	cal_3UTR	gene	43274	44273	.	+	.	ID=SSTP_0000549600.1
SSTP_contig0000001	cal_3UTR	gene	22058	23057	.	-	.	ID=SSTP_0000549000.1
SSTP_contig0000001	cal_3UTR	gene	28600	29599	.	-	.	ID=SSTP_0000549300.1

We can combine the files (part*.txt) and make a final gff. Then extract the sequences

1
2
3
4
5


>SSTP_0000548400.1|SSTP_contig0000001|1-220
ACAACGTCCACAAGCAGAACAAGAGAACAACTAAAAGTGGTAAAAAAACAATGCTAAATA
TATAAGTATATAAAATAACGAACAAAACATATTAAAAATATAGTAAATATTCAAAGTGGC
AAAAAGAGAAATATCTAAGACTCAATAGTAGGGATGTGAAGCTTAAAAGAAAGGTTTTTT
TTTAAAAGGCCAGGTGCCACTTCGAAAGGGTATAATTTTT

Just to check the lengths of extracted 3’UTRs (show random 5 genes):

1
2
3
4
5
6
7


python fa-length.py calculated_3UTR-combined.fa | sort -R | head -n5

SSTP_0000483700.1|SSTP_contig0000074|16441-17440	1000
SSTP_0000826800.1|SSTP_scaffold0000019|135995-136463	469
SSTP_0000335500.1|SSTP_contig0000020|38343-39342	1000
SSTP_0000190800.1|SSTP_scaffold0000001|3366064-3366564	501
SSTP_0000374800.1|SSTP_contig0000002|598914-598972	59