Concepts on Bioinformatics: Pairwise Sequence Alignment

Hi Guys,

Interesting fact: Richard Robert got 1993's noble prize for discovering introns in genes.

So, lets talk about sequence alignment. In principle if two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.

An example of nucleotide sequence alignment:

Few Basic Definitions:

There are two different definitions of comparing sequences:

1. Edit Distance: The edit distance is interpreted as the number evolutionary operations to transform a string A into B using mutations, insertion and deletion of one symbol as the only evolutionary operations.

2. Alignment Distance: optimal value of edit distance for a given cost function.

When calculating relevant between pair of sequence, we might face two different ways of calculating cost. First one as described above is distance and we always try to minimize the distance between two sequence while other annotation is similarity where we try to maximize the similarity score.

There also can be two different type of alignments, global, searching for overall similarity or local, searching for blocks of matching substrings. A general global alignment technique is the Needleman-Wunsch algorithm, which is based on dynamic programming. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. With sufficiently similar sequences, there is no difference between local and global alignments.

Both algorithms are based on the calculation of a pairwise score matrix which aligns and corresponds a score for each pair position. It can be calculated using dynamic programming.

To find the alignment with the highest score, a two-dimensional array (or matrix) F is allocated. The entry in row i and column j is denoted here by

F i j

. There is one column for each character in sequence A, and one row for each character in sequence B. Thus, if we are aligning sequences of sizes n and m, the amount of memory used is in

O (n m)

. (Hirschberg's algorithm can compute an optimal alignment in

Θ(min{n, m})

space, roughly doubling the running time.) As the algorithm progresses, the

F i j

will be assigned to be the optimal score for the alignment of the first

i = 0,..., n

characters in A and the first

j = 0,..., m

characters in B. The principle of optimality is then applied as follows.

Basis:
   $F 0 j = d * j$ 
   $F i 0 = d * i$ 
  Recursion, based on the principle of optimality:

A F matrix looks like:

There are many more possibilities to improve alignment speed and space consumption. Such as Hirschberg's algorithm, wraparound method, using affine gap cost etc.

I implemented both local and global alignment in python. Both can be executed by command line.

Global Alignment

######### prepared as project work for course Algorithms in Bioinformatics #########

#!/usr/bin/env python
#include<Python.h>
# Author - Vikas Gupta
# Email - vgupta@birc.au.dk
# Created on 20082010 and being continueously updated
######### Script was made using Python 3.0.1 on windows Platform ##########
## if it is not working on your system then let me know
import sys, time, os, re, math
from sys import *
import getopt, pprint
start = time.clock()
############### Defult parameters ######################

def default():
   print ("""\
Hey There, try : global_linalign.py Seq1|file1 Seq2|file2 [option]
-m file reads matrix from a file
   Default Matrix:
   acgt
   10
   2 10
   5 2 10
   2 5 2 10
-g val Gaps are required to be a multiple of val. (Default = -5)
""" )

#print (len(sys.argv))
#User interface in Commond line
seq1=''
seq2=''
flag = 0
if (len(sys.argv)<3):
   default()
if (len(sys.argv)>=3):
   s=sys.argv[1]
   #print (s)
   if os.path.isfile(s):
   try:
   s=sys.argv[1]
   file1 = open(s,'r')
   for line in file1.readlines():
   if(line[0]=='>'):
   continue
   else:
   line = line.strip()
   seq1 = seq1 + line
   flag = 1
   # print ("ok")
   s=sys.argv[2]
   file2 = open(s,'r')
   for line in file2.readlines():
   if(line[0]=='>'):
   continue
   else:
   line = line.strip()
   seq2 = seq2 + line
   flag = 1
   # print ("ok")

   except IOError:
   print("can't open file")
   if (flag==0):
   try:
   seq1 = sys.argv[1]
   seq2 = sys.argv[2]
   except ImportError:
   default()
   sys.exit(2)
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
if (len(sys.argv)>3):
   if (sys.argv[3] == '-m'):
   matrix_file=sys.argv[4]
   s = {}
   letters = {}
   mat = open(matrix_file,'r')
   first_line = True
   count = 0
   for line in mat.readlines():
   if line[0] == "#":
   continue
   if first_line:
   line_count = int(line.strip())
   #print (line_count)
   first_line = False
   else:
   line = line.strip()
   tokens = line.split()
   letters[count] = tokens[0]
   #print (letters[count])
   count +=1
   if (count > line_count):
   print ("substitution score matrix has more variable than notified in first line")
   mat.close()
   mat = open(matrix_file,'r')
   first_line = True
   #print (line_count,count)
   for line in mat.readlines():
   if line[0] == "#":
   continue
   if first_line:
   first_line = False
   else :
   #print (line)
   line = line.strip()
   tokens = line.split()
   row_letter = tokens[0]
   #print (tokens[0])
   for i in range(0,line_count):
   c = i + 1
   s[(row_letter)+(letters[i])]= int(tokens[c])
   #print(s['AA'])
if(len(sys.argv)>5):
   if(sys.argv[5] == '-g'):
   val = float(sys.argv[6])
   #print (val)
   gap = -1*val
   #print (gap)

#aking input sequences
seq1='AATAAT'
seq2='AAGG'
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
print ("input sequence 1---->",seq1)
print ("input sequence 2---->",seq2)

#stderr.write('wait I am calulaing\n')

rows=len(seq1)+1
cols=len(seq2)+1
########making upper case############################
for i in range(0,len(seq1)):
   if (ord(seq1[i])>=97 and ord(seq1[i])<=122):
   seq1=seq1.replace(chr(ord(seq1[i])),chr(ord(seq1[i])-32))
for i in range(0,len(seq2)):
   if (ord(seq2[i])>=97 and ord(seq2[i])<=122):
   seq2=seq2.replace(chr(ord(seq2[i])),chr(ord(seq2[i])-32))

try:
   #use fast numerical method array if we can
   from numpy import *
   a = zeroes((rows,cols),float)
except ImportError:
   #use a list if we have to
   a=[]
   for i in range(rows):
   a+=[[0.]*cols]

#using substitution matrix and gap penalties
if (len(sys.argv)<4):
   match= 10
   gap=-5
   s={'AA':match,'AG':5,'AC':2,'AT':2,\
   'GA':5,'GG':match,'GC':2,'GT':2,\
   'CA':2,'CG':2,'CC':match,'CT':5,\
   'TA':2,'TG':2,'TC':5,'TT':match}
   #print (s)

#making optimal matrix now

for i in range(rows):
   a[i][0]= -5*i
for j in range(cols):
   a[0][j]= -5*j

for i in range(1,rows):
   for j in range(1,cols):
   choice1=a[i-1][j-1]+s[(seq1[i-1])+(seq2[j-1])]
   choice2=a[i-1][j]+gap
   choice3=a[i][j-1]+gap
   a[i][j] = max(choice1,choice2,choice3)
for i in range(0,rows):
   s2= []
   for j in range(0,cols):
   s2.append(str(a[i][j]))
   s2.append(" ")
   print("".join(s2))

   #we reconstruct the alignment into aseq1 and aseq2

aseq1 = ''
aseq2 = ''

i=len(seq1)
j=len(seq2)
#print(i,j)
os = 0
count_align = 0
while (i>0 and j>0):
  score=a[i][j]
  score_dig=a[i-1][j-1]
  score_up=a[i][j-1]
  score_left=a[i-1][j]
  if ( (score == score_up + gap) & (score == score_left + gap) ):
   count_align += 2
  if ( (score == score_dig + s[seq1[i-1]+ seq2[j-1]]) & (score == score_left + gap) ):
   count_align += 2
  if ( (score == score_dig + s[seq1[i-1]+ seq2[j-1]]) & (score == score_up + gap) ):
   count_align += 2
  if score == score_dig + s[seq1[i-1]+ seq2[j-1]]:
   aseq1 = seq1[i-1] + aseq1
   aseq2 = seq2[j-1] + aseq2
   os += s[seq1[i-1]+ seq2[j-1]]
   i -= 1
   j -= 1
  elif score == score_left + gap:
   aseq1 = seq1[i-1]+ aseq1
   aseq2 = '_' + aseq2
   os += gap
   i -=1
  elif score == score_up + gap:
   aseq1 = '_' + aseq1
   aseq2 = seq2[j-1]+aseq2
   j -=1
   os += gap

  else:
   print ('what the hell is wrong :\n')
   i=0
   j=0
   aseq1='Error';
while i>0:
   aseq1 = seq1[i-1] + aseq1
   aseq2 = '_' + aseq2
   os += gap
   i -=1
while j>0:
   aseq1 = '_' + aseq1
   aseq2 = seq2[j-1] + aseq2
   os += gap
   j -=1
print(">seq1")
print (aseq1)
print(">seq2")
print (aseq2)
print("optimal alignment score--->",os)
elapsed = (time.clock() - start)
print ("calculation time---->",elapsed)
print ("number of alignment--->",count_align)

##################################################################################

Local Alignment

########################################################

###############Local Alignment##########################

########################################################

#!/usr/bin/env python
#include<Python.h>
######### prepared as project work for course Algorithms in Bioinformatics #########
# Author - Vikas Gupta
# Email - vgupta@birc.au.dk
# Created on 20082010 and being continueously updated
######### Script was made using Python 3.0.1 on windows Platform ##########
## if it is not working on your system then let me know
import sys, time, os, re, math
from sys import *
import getopt, pprint

start = time.clock()

############### Defult parameters ######################
def default():
   print ("""\
Hey There, try : global_linalign.py Seq1|file1 Seq2|file2 [option]
-m file reads matrix from a file
   Default Matrix:
   acgt
   10
   2 10
   5 2 10
   2 5 2 10
-g val Gaps are required to be a multiple of val. (Default = -5)
""" )


#print (len(sys.argv))
#User interface in Commond line
seq1=''
seq2=''
flag = 0
if (len(sys.argv)<3):
   default()
if (len(sys.argv)>=3):
   s=sys.argv[1]
   #print (s)
   if os.path.isfile(s):
   try:
   s=sys.argv[1]
   file1 = open(s,'r')
   for line in file1.readlines():
   if(line[0]=='>'):
   continue
   else:
   line = line.strip()
   seq1 = seq1 + line
   flag = 1
   #print ("ok")
   s=sys.argv[2]
   file2 = open(s,'r')
   for line in file2.readlines():
   if(line[0]=='>'):
   continue
   else:
   line = line.strip()
   seq2 = seq2 + line
   flag = 1
   #print ("ok")

   except IOError:
   print("can't open file")
   if (flag==0):
   try:
   seq1 = sys.argv[1]
   seq2 = sys.argv[2]
   except ImportError:
   default()
   sys.exit(2)
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
if (len(sys.argv)>3):
   if (sys.argv[3] == '-m'):
   matrix_file=sys.argv[4]
   s = {}
   letters = {}
   mat = open(matrix_file,'r')
   first_line = True
   count = 0
   for line in mat.readlines():
   if line[0] == "#":
   continue
   if first_line:
   line_count = int(line.strip())
   #print (line_count)
   first_line = False
   else:
   line = line.strip()
   tokens = line.split()
   letters[count] = tokens[0]
   print (letters[count])
   count +=1
   if (count > line_count):
   print ("substitution score matrix has more variable than notified in first line")
   mat.close()
   mat = open(matrix_file,'r')
   first_line = True
   print (line_count,count)
   for line in mat.readlines():
   if line[0] == "#":
   continue
   if first_line:
   first_line = False
   else :
   #print (line)
   line = line.strip()
   tokens = line.split()
   row_letter = tokens[0]
   #print (tokens[0])
   for i in range(0,line_count):
   c = i + 1
   s[(row_letter)+(letters[i])]= int(tokens[c])
   print(s['AA'])
if(len(sys.argv)>5):
   if(sys.argv[5] == '-g'):
   val = float(sys.argv[6])
   print (val)
   gap = -1*val
   print (gap)

#seq1='AATAAT'
#seq2='AAGG'
seq1 = seq1.replace(' ','')
seq2 = seq2.replace(' ','')
print ("input seq1-->",seq1)
print ("input seq2-->",seq2)

stderr.write('wait I am calulaing\n')

rows=len(seq1)+1
cols=len(seq2)+1

########making upper case############################
for i in range(0,len(seq1)):
   if (ord(seq1[i])>=97 and ord(seq1[i])<=122):
   seq1=seq1.replace(chr(ord(seq1[i])),chr(ord(seq1[i])-32))
for i in range(0,len(seq2)):
   if (ord(seq2[i])>=97 and ord(seq2[i])<=122):
   seq2=seq2.replace(chr(ord(seq2[i])),chr(ord(seq2[i])-32))

try:
   #use fast numerical method array if we can
   from numpy import *
   a = zeroes((rows,cols),float)
except ImportError:
   #use a list if we have to
   a=[]
   for i in range(rows):
   a+=[[0.]*cols]

#using substitution matrix and gap penalties
if (len(sys.argv)<=3):
   match= 10
   mismatch= 5
   gap=-5
   s={'AA':match,'AG':5,'AC':2,'AT':2,\
   'GA':5,'GG':match,'GC':2,'GT':2,\
   'CA':2,'CG':2,'CC':match,'CT':5,\
   'TA':2,'TG':2,'TC':5,'TT':match}
   #print (s)

#making optimal matrix now

for i in range(rows):
   a[i][0]= 0
for j in range(cols):
   a[0][j]= 0
maximum = 0
max_i = 0
max_j = 0
for i in range(1,rows):
   for j in range(1,cols):
   choice1=a[i-1][j-1]+s[(seq1[i-1]+seq2[j-1])]
   choice2=a[i-1][j]+gap
   choice3=a[i][j-1]+gap
   choice4=0
   a[i][j] = max(choice1,choice2,choice3,choice4)
   if (a[i][j]>=maximum):
   maximum = a[i][j];
   max_i = i;
   max_j = j;
   #print(a[i][j],i,j,'\t')

#we reconstruct the alignment into aseq1 and aseq2

aseq1 = ''
aseq2 = ''

i= max_i;
j= max_j;
#print(i,j)
os = 0
while (i>0 and j>0):
  score=a[i][j]
  score_dig=a[i-1][j-1]
  score_up=a[i][j-1]
  score_left=a[i-1][j]
  if (a[i][j]==0):
   i = 0
   j = 0
  elif score == score_dig + s[seq1[i-1]+ seq2[j-1]]:
   aseq1 = seq1[i-1] + aseq1
   aseq2 = seq2[j-1] + aseq2
   os += s[seq1[i-1]+ seq2[j-1]]
   i -= 1
   j -= 1
  elif score == score_left + gap:
   aseq1 = seq1[i-1]+ aseq1
   aseq2 = '_' + aseq2
   os += gap
   i -=1
  elif score == score_up + gap:
   aseq1 = '_' + aseq1
   aseq2 = seq2[j-1]+aseq2
   j -=1
   os += gap
  else:
   print ('what the hell is wrong :\n')
   i=0
   j=0
   aseq1='Error';
while i>0:
   aseq1 = seq1[i-1] + aseq1
   aseq2 = '_' + aseq2
   os += gap
   i -=1
while j>0:
   aseq1 = '_' + aseq1
   aseq2 = seq2[j-1] + aseq2
   os += gap
   j -=1
print(">seq1")
print (aseq1)
print(">seq2")
print (aseq2)
print("optimal alignment score--->",os)
elapsed = (time.clock() - start)
print ("calculation time---->",elapsed)

##########################################################################

Concepts on Bioinformatics

Monday, October 18, 2010

Pairwise Sequence Alignment

No comments:

Post a Comment