Monday, October 18, 2010

Pairwise Sequence Alignment




Hi Guys,

Interesting fact: Richard Robert got 1993's noble prize for discovering introns in genes. 

So, lets talk about sequence alignment. In principle if two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.

An example of nucleotide sequence alignment:




Few Basic Definitions:

There are two different definitions of comparing sequences:

1. Edit Distance:  The edit distance is interpreted as the number evolutionary operations to transform a string A into B using mutations, insertion and deletion of one symbol as the only evolutionary operations.

2. Alignment Distance: optimal value of edit distance for a given cost function.

When calculating relevant between pair of sequence, we might face two different ways of calculating cost. First one as described above is distance and we always try to minimize the distance between two sequence while other annotation is similarity where we try to maximize the similarity score. 

There also can be two different type of alignments, global, searching for overall similarity or local, searching for blocks of matching substrings. A general global alignment technique is the Needleman-Wunsch algorithm, which is based on dynamic programming. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. With sufficiently similar sequences, there is no difference between local and global alignments.

Both algorithms are based on the calculation of a pairwise score matrix which aligns and corresponds a score for each pair position. It can be calculated using dynamic programming. 

To find the alignment with the highest score, a two-dimensional array (or matrix) F is allocated. The entry in row i and column j is denoted here by Fij. There is one column for each character in sequence A, and one row for each character in sequence B. Thus, if we are aligning sequences of sizes n and m, the amount of memory used is in O(nm). (Hirschberg's algorithm can compute an optimal alignment in Θ(min{n,m}) space, roughly doubling the running time.) As the algorithm progresses, the Fij will be assigned to be the optimal score for the alignment of the first i = 0,...,n characters in A and the first j = 0,...,m characters in B. The principle of optimality is then applied as follows. 
 
Basis:
  F0j = d * j
  Fi0 = d * i
  Recursion, based on the principle of optimality:
  F_{ij} = \max(F_{i-1,j-1} + S(A_{i}, B_{j}), \; F_{i,j-1} + d, \; F_{i-1,j} + d)
A F matrix looks like: 

There are many more possibilities to improve alignment speed and space consumption. Such as Hirschberg's algorithm, wraparound method, using affine gap cost etc.

I implemented both local and global alignment in python. Both can be executed by command line.

Global Alignment

######### prepared as project work for course Algorithms in Bioinformatics #########

#!/usr/bin/env python
#include<Python.h>
# Author - Vikas Gupta
# Email - vgupta@birc.au.dk
# Created on 20082010 and being continueously updated
######### Script was made using Python 3.0.1 on windows Platform ##########
## if it is not working on your system then let me know
import sys, time, os, re, math
from sys import *
import getopt, pprint
start = time.clock()
############### Defult parameters ######################

def default():
    print ("""\
Hey There, try : global_linalign.py Seq1|file1 Seq2|file2 [option]
-m file reads matrix from a file
        Default Matrix:
            acgt
            10
            2  10
            5   2  10
            2   5  2  10
-g val    Gaps are required to be a multiple of val. (Default = -5)
""" )
          
#print (len(sys.argv))
#User interface in Commond line
seq1=''
seq2=''
flag = 0
if (len(sys.argv)<3):
    default()
if (len(sys.argv)>=3):
        s=sys.argv[1]
        #print (s)
        if os.path.isfile(s):
            try:
                s=sys.argv[1]
                file1 = open(s,'r')
                for line in file1.readlines():
                    if(line[0]=='>'):
                        continue
                    else:
                        line = line.strip()
                        seq1 = seq1 + line
                        flag = 1
               # print ("ok")
                s=sys.argv[2]
                file2 = open(s,'r')
                for line in file2.readlines():
                    if(line[0]=='>'):
                        continue
                    else:
                        line = line.strip()
                        seq2 = seq2 + line
                        flag = 1
               # print ("ok")
          
            except IOError:
                print("can't open file")
        if (flag==0):
            try:
                seq1 = sys.argv[1]
                seq2 = sys.argv[2]
            except ImportError:
                default()
                sys.exit(2)
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
if (len(sys.argv)>3):
    if (sys.argv[3] == '-m'):
        matrix_file=sys.argv[4]
        s = {}
        letters = {}
        mat = open(matrix_file,'r')
        first_line = True
        count = 0
    for line in mat.readlines():
        if line[0] == "#":
            continue
        if first_line:
            line_count = int(line.strip())
            #print (line_count)
            first_line = False
        else:
           line = line.strip()
           tokens = line.split()
           letters[count] = tokens[0]
           #print (letters[count])
           count +=1
           if (count > line_count):
                print ("substitution score matrix has more variable than notified in first line")
    mat.close()
    mat = open(matrix_file,'r')
    first_line = True
    #print (line_count,count)
    for line in mat.readlines():
        if line[0] == "#":
            continue
        if first_line:
            first_line = False
        else :
            #print (line)
            line = line.strip()
            tokens = line.split()
            row_letter = tokens[0]
            #print (tokens[0])
            for i in range(0,line_count):
                c = i + 1
                s[(row_letter)+(letters[i])]= int(tokens[c])
                #print(s['AA'])
if(len(sys.argv)>5):
    if(sys.argv[5] == '-g'):
        val = float(sys.argv[6])
        #print (val)
        gap = -1*val
        #print (gap)
  
#aking input sequences
seq1='AATAAT'
seq2='AAGG'
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
print ("input sequence 1---->",seq1)
print ("input sequence 2---->",seq2)

#stderr.write('wait I am calulaing\n')

rows=len(seq1)+1
cols=len(seq2)+1
########making upper case############################
for i in range(0,len(seq1)):
    if (ord(seq1[i])>=97 and ord(seq1[i])<=122):
        seq1=seq1.replace(chr(ord(seq1[i])),chr(ord(seq1[i])-32))
for i in range(0,len(seq2)):
    if (ord(seq2[i])>=97 and ord(seq2[i])<=122):
        seq2=seq2.replace(chr(ord(seq2[i])),chr(ord(seq2[i])-32))


try:
    #use fast numerical method array if we can
    from numpy import *
    a = zeroes((rows,cols),float)
except ImportError:
    #use a list if we have to
    a=[]
    for i in range(rows):
        a+=[[0.]*cols]


#using substitution matrix and gap penalties
if (len(sys.argv)<4):
        match= 10
        gap=-5
        s={'AA':match,'AG':5,'AC':2,'AT':2,\
           'GA':5,'GG':match,'GC':2,'GT':2,\
           'CA':2,'CG':2,'CC':match,'CT':5,\
           'TA':2,'TG':2,'TC':5,'TT':match}
        #print (s)

#making optimal matrix now

for i in range(rows):
    a[i][0]= -5*i
for j in range(cols):
    a[0][j]= -5*j

for i in range(1,rows):
    for j in range(1,cols):
        choice1=a[i-1][j-1]+s[(seq1[i-1])+(seq2[j-1])]
        choice2=a[i-1][j]+gap
        choice3=a[i][j-1]+gap
        a[i][j] = max(choice1,choice2,choice3)
for i in range(0,rows):
    s2= []
    for j in range(0,cols):
        s2.append(str(a[i][j]))
        s2.append(" ")
    print("".join(s2))

               #we reconstruct the alignment into aseq1 and aseq2

aseq1 = ''
aseq2 = ''

i=len(seq1)
j=len(seq2)
#print(i,j)
os = 0
count_align = 0
while (i>0 and j>0):
  score=a[i][j]
  score_dig=a[i-1][j-1]
  score_up=a[i][j-1]
  score_left=a[i-1][j]
  if ( (score == score_up + gap) & (score == score_left + gap) ):
      count_align += 2
  if ( (score == score_dig + s[seq1[i-1]+ seq2[j-1]]) & (score == score_left + gap) ):
      count_align += 2
  if ( (score == score_dig + s[seq1[i-1]+ seq2[j-1]]) & (score == score_up + gap) ):
      count_align += 2
  if score == score_dig + s[seq1[i-1]+ seq2[j-1]]:
           aseq1 = seq1[i-1] + aseq1
           aseq2 = seq2[j-1] + aseq2
           os += s[seq1[i-1]+ seq2[j-1]]
           i -= 1
           j -= 1
  elif score == score_left + gap:
                                   aseq1 = seq1[i-1]+ aseq1
                                   aseq2 = '_' + aseq2
                                   os += gap
                                   i -=1
  elif score == score_up + gap:
                  aseq1 = '_' + aseq1
                  aseq2 = seq2[j-1]+aseq2
                  j -=1
                  os += gap
  
  else:
                                   print ('what the hell is wrong :\n')
                                   i=0
                                   j=0
                                   aseq1='Error';
while i>0:
                                   aseq1 = seq1[i-1] + aseq1
                                   aseq2 = '_' + aseq2
                                   os += gap
                                   i -=1
while j>0:
                                   aseq1 = '_' + aseq1
                                   aseq2 = seq2[j-1] + aseq2
                                   os += gap
                                   j -=1
print(">seq1")
print (aseq1)
print(">seq2")
print (aseq2)
print("optimal alignment score--->",os)
elapsed = (time.clock() - start)
print ("calculation time---->",elapsed)
print ("number of alignment--->",count_align)
##################################################################################


Local Alignment


########################################################
###############Local Alignment##########################
########################################################


#!/usr/bin/env python
#include<Python.h>
######### prepared as project work for course Algorithms in Bioinformatics #########
# Author - Vikas Gupta
# Email - vgupta@birc.au.dk
# Created on 20082010 and being continueously updated
######### Script was made using Python 3.0.1 on windows Platform ##########
## if it is not working on your system then let me know
import sys, time, os, re, math
from sys import *
import getopt, pprint

start = time.clock()



############### Defult parameters ######################
def default():
    print ("""\
Hey There, try : global_linalign.py Seq1|file1 Seq2|file2 [option]
-m file reads matrix from a file
        Default Matrix:
            acgt
            10
            2  10
            5   2  10
            2   5  2  10
-g val    Gaps are required to be a multiple of val. (Default = -5)
""" )
          

#print (len(sys.argv))
#User interface in Commond line
seq1=''
seq2=''
flag = 0
if (len(sys.argv)<3):
    default()
if (len(sys.argv)>=3):
        s=sys.argv[1]
        #print (s)
        if os.path.isfile(s):
            try:
                s=sys.argv[1]
                file1 = open(s,'r')
                for line in file1.readlines():
                    if(line[0]=='>'):
                        continue
                    else:
                        line = line.strip()
                        seq1 = seq1 + line
                        flag = 1
                #print ("ok")
                s=sys.argv[2]
                file2 = open(s,'r')
                for line in file2.readlines():
                    if(line[0]=='>'):
                        continue
                    else:
                        line = line.strip()
                        seq2 = seq2 + line
                        flag = 1
                #print ("ok")
          
            except IOError:
                print("can't open file")
        if (flag==0):
            try:
                seq1 = sys.argv[1]
                seq2 = sys.argv[2]
            except ImportError:
                default()
                sys.exit(2)
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
if (len(sys.argv)>3):
    if (sys.argv[3] == '-m'):
        matrix_file=sys.argv[4]
        s = {}
        letters = {}
        mat = open(matrix_file,'r')
        first_line = True
        count = 0
    for line in mat.readlines():
        if line[0] == "#":
            continue
        if first_line:
            line_count = int(line.strip())
            #print (line_count)
            first_line = False
        else:
           line = line.strip()
           tokens = line.split()
           letters[count] = tokens[0]
           print (letters[count])
           count +=1
           if (count > line_count):
                print ("substitution score matrix has more variable than notified in first line")
    mat.close()
    mat = open(matrix_file,'r')
    first_line = True
    print (line_count,count)
    for line in mat.readlines():
        if line[0] == "#":
            continue
        if first_line:
            first_line = False
        else :
            #print (line)
            line = line.strip()
            tokens = line.split()
            row_letter = tokens[0]
            #print (tokens[0])
            for i in range(0,line_count):
                c = i + 1
                s[(row_letter)+(letters[i])]= int(tokens[c])
                print(s['AA'])
if(len(sys.argv)>5):
    if(sys.argv[5] == '-g'):
        val = float(sys.argv[6])
        print (val)
        gap = -1*val
        print (gap)
  
#seq1='AATAAT'
#seq2='AAGG'
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
print ("input seq1-->",seq1)
print ("input seq2-->",seq2)

stderr.write('wait I am calulaing\n')

rows=len(seq1)+1
cols=len(seq2)+1

########making upper case############################
for i in range(0,len(seq1)):
    if (ord(seq1[i])>=97 and ord(seq1[i])<=122):
        seq1=seq1.replace(chr(ord(seq1[i])),chr(ord(seq1[i])-32))
for i in range(0,len(seq2)):
    if (ord(seq2[i])>=97 and ord(seq2[i])<=122):
        seq2=seq2.replace(chr(ord(seq2[i])),chr(ord(seq2[i])-32))

try:
    #use fast numerical method array if we can
    from numpy import *
    a = zeroes((rows,cols),float)
except ImportError:
    #use a list if we have to
    a=[]
    for i in range(rows):
        a+=[[0.]*cols]


#using substitution matrix and gap penalties
if (len(sys.argv)<=3):
    match= 10
    mismatch= 5
    gap=-5
    s={'AA':match,'AG':5,'AC':2,'AT':2,\
       'GA':5,'GG':match,'GC':2,'GT':2,\
       'CA':2,'CG':2,'CC':match,'CT':5,\
       'TA':2,'TG':2,'TC':5,'TT':match}
    #print (s)

#making optimal matrix now

for i in range(rows):
    a[i][0]= 0
for j in range(cols):
    a[0][j]= 0
maximum = 0
max_i = 0
max_j = 0
for i in range(1,rows):
    for j in range(1,cols):
        choice1=a[i-1][j-1]+s[(seq1[i-1]+seq2[j-1])]
        choice2=a[i-1][j]+gap
        choice3=a[i][j-1]+gap
        choice4=0
        a[i][j] = max(choice1,choice2,choice3,choice4)
        if (a[i][j]>=maximum):
            maximum = a[i][j];
            max_i = i;
            max_j = j;
        #print(a[i][j],i,j,'\t')

#we reconstruct the alignment into aseq1 and aseq2

aseq1 = ''
aseq2 = ''

i= max_i;
j= max_j;
#print(i,j)
os = 0
while (i>0 and j>0):
  score=a[i][j]
  score_dig=a[i-1][j-1]
  score_up=a[i][j-1]
  score_left=a[i-1][j]
  if (a[i][j]==0):
      i = 0
      j = 0
  elif score == score_dig + s[seq1[i-1]+ seq2[j-1]]:
           aseq1 = seq1[i-1] + aseq1
           aseq2 = seq2[j-1] + aseq2
           os += s[seq1[i-1]+ seq2[j-1]]
           i -= 1
           j -= 1
  elif score == score_left + gap:
                                   aseq1 = seq1[i-1]+ aseq1
                                   aseq2 = '_' + aseq2
                                   os += gap
                                   i -=1
  elif score == score_up + gap:
                  aseq1 = '_' + aseq1
                  aseq2 = seq2[j-1]+aseq2
                  j -=1
                  os += gap
  else:
                                   print ('what the hell is wrong :\n')
                                   i=0
                                   j=0
                                   aseq1='Error';
while i>0:
                                   aseq1 = seq1[i-1] + aseq1
                                   aseq2 = '_' + aseq2
                                   os += gap
                                   i -=1
while j>0:
                                   aseq1 = '_' + aseq1
                                   aseq2 = seq2[j-1] + aseq2
                                   os += gap
                                   j -=1
print(">seq1")
print (aseq1)
print(">seq2")
print (aseq2)
print("optimal alignment score--->",os)
elapsed = (time.clock() - start)
print ("calculation time---->",elapsed)

##########################################################################


No comments:

Post a Comment