1
2
3
4
5
6
7
8 """Substitution matrices, log odds matrices, and operations on them.
9
10 General:
11 -------
12
13 This module provides a class and a few routines for generating
14 substitution matrices, similar ot BLOSUM or PAM matrices, but based on
15 user-provided data.
16 The class used for these matrices is SeqMat
17
18 Matrices are implemented as a dictionary. Each index contains a 2-tuple,
19 which are the two residue/nucleotide types replaced. The value differs
20 according to the matrix's purpose: e.g in a log-odds frequency matrix, the
21 value would be log(Pij/(Pi*Pj)) where:
22 Pij: frequency of substitution of letter (residue/nucletide) i by j
23 Pi, Pj: expected frequencies of i and j, respectively.
24
25 Usage:
26 -----
27 The following section is layed out in the order by which most people wish
28 to generate a log-odds matrix. Of course, interim matrices can be
29 generated and investigated. Most people just want a log-odds matrix,
30 that's all.
31
32 Generating an Accepted Replacement Matrix:
33 -----------------------------------------
34 Initially, you should generate an accepted replacement matrix (ARM)
35 from your data. The values in ARM are the _counted_ number of
36 replacements according to your data. The data could be a set of pairs
37 or multiple alignments. So for instance if Alanine was replaced by
38 Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding
39 ARM entries would be:
40 ['A','C']: 10,
41 ['C','A'] 12
42 As order doesn't matter, user can already provide only one entry:
43 ['A','C']: 22
44 A SeqMat instance may be initialized with either a full (first
45 method of counting: 10, 12) or half (the latter method, 22) matrices. A
46 Full protein alphabet matrix would be of the size 20x20 = 400. A Half
47 matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because
48 same-letter entries don't change. (The matrix diagonal). Given an
49 alphabet size of N:
50 Full matrix size:N*N
51 Half matrix size: N(N+1)/2
52
53 If you provide a full matrix, the constructore will create a half-matrix
54 automatically.
55 If you provide a half-matrix, make sure of a (low, high) sorted order in
56 the keys: there should only be
57 a ('A','C') not a ('C','A').
58
59 Internal functions:
60
61 Generating the observed frequency matrix (OFM):
62 ----------------------------------------------
63 Use: OFM = _build_obs_freq_mat(ARM)
64 The OFM is generated from the ARM, only instead of replacement counts, it
65 contains replacement frequencies.
66
67 Generating an expected frequency matrix (EFM):
68 ---------------------------------------------
69 Use: EFM = _build_exp_freq_mat(OFM,exp_freq_table)
70 exp_freq_table: should be a freqTableC instantiation. See freqTable.py for
71 detailed information. Briefly, the expected frequency table has the
72 frequencies of appearance for each member of the alphabet
73
74 Generating a substitution frequency matrix (SFM):
75 ------------------------------------------------
76 Use: SFM = _build_subs_mat(OFM,EFM)
77 Accepts an OFM, EFM. Provides the division product of the corresponding
78 values.
79
80 Generating a log-odds matrix (LOM):
81 ----------------------------------
82 Use: LOM=_build_log_odds_mat(SFM[,logbase=10,factor=10.0,roundit=1])
83 Accepts an SFM. logbase: base of the logarithm used to generate the
84 log-odds values. factor: factor used to multiply the log-odds values.
85 roundit: default - true. Whether to round the values.
86 Each entry is generated by log(LOM[key])*factor
87 And rounded if required.
88
89 External:
90 ---------
91 In most cases, users will want to generate a log-odds matrix only, without
92 explicitly calling the OFM --> EFM --> SFM stages. The function
93 build_log_odds_matrix does that. User provides an ARM and an expected
94 frequency table. The function returns the log-odds matrix.
95
96 Methods for subtraction, addition and multiplication of matrices:
97 -----------------------------------------------------------------
98 * Generation of an expected frequency table from an observed frequency
99 matrix.
100 * Calculation of linear correlation coefficient between two matrices.
101 * Calculation of relative entropy is now done using the
102 _make_relative_entropy method and is stored in the member
103 self.relative_entropy
104 * Calculation of entropy is now done using the _make_entropy method and
105 is stored in the member self.entropy.
106 * Jensen-Shannon distance between the distributions from which the
107 matrices are derived. This is a distance function based on the
108 distribution's entropies.
109 """
110
111
112 import re
113 import sys
114 import copy
115 import math
116 import warnings
117
118
119 from Bio import Alphabet
120 from Bio.SubsMat import FreqTable
121
122 log = math.log
123
124 NOTYPE = 0
125 ACCREP = 1
126 OBSFREQ = 2
127 SUBS = 3
128 EXPFREQ = 4
129 LO = 5
130 EPSILON = 0.00000000000001
131
132
134 """A Generic sequence matrix class
135 The key is a 2-tuple containing the letter indices of the matrix. Those
136 should be sorted in the tuple (low, high). Because each matrix is dealt
137 with as a half-matrix."""
138
140 ab_dict = {}
141 s = ''
142 for i in self.keys():
143 ab_dict[i[0]] = 1
144 ab_dict[i[1]] = 1
145 letters_list = ab_dict.keys()
146 letters_list.sort()
147 for i in letters_list:
148 s = s + i
149 self.alphabet.letters = s
150
151 - def __init__(self,data=None, alphabet=None,
152 mat_type=None,mat_name='',build_later=0):
209
211 keylist = self.keys()
212 for key in keylist:
213 if key[0] > key[1]:
214 self[(key[1],key[0])] = self[key]
215 del self[key]
216
218 """
219 Convert a full-matrix to a half-matrix
220 """
221
222
223
224
225 N = len(self.alphabet.letters)
226
227 if len(self) == N*(N+1)/2:
228 return
229 for i in self.ab_list:
230 for j in self.ab_list[:self.ab_list.index(i)+1]:
231 if i != j:
232 self[j,i] = self[j,i] + self[i,j]
233 del self[i,j]
234
236 for i in self.ab_list:
237 for j in self.ab_list[:self.ab_list.index(i)+1]:
238 self[j,i] = 0.
239
241 """if this matrix is a log-odds matrix, return its entropy
242 Needs the observed frequency matrix for that"""
243
244 ent = 0.
245 if self.mat_type == LO:
246 for i in self.keys():
247 ent += obs_freq_mat[i]*self[i]/log(2)
248 elif self.mat_type == SUBS:
249 for i in self.keys():
250 if self[i] > EPSILON:
251 ent += obs_freq_mat[i]*log(self[i])/log(2)
252 else:
253 raise TypeError("entropy: substitution or log-odds matrices only")
254 self.relative_entropy = ent
255
257 self.entropy = 0
258 for i in self.keys():
259 if self[i] > EPSILON:
260 self.entropy += self[i]*log(self[i])/log(2)
261 self.entropy = -self.entropy
262
264 warnings.warn("SeqMat.letter_sum is deprecated; please use SeqMat.sum instead", DeprecationWarning)
265 assert letter in self.alphabet.letters
266 sum = 0.
267 for i in self.keys():
268 if letter in i:
269 if i[0] == i[1]:
270 sum += self[i]
271 else:
272 sum += (self[i] / 2.)
273 return sum
274
276 import warnings
277 warnings.warn("SeqMat.all_letters_sum is deprecated; please use SeqMat.sum instead", DeprecationWarning)
278 for letter in self.alphabet.letters:
279 self.sum_letters[letter] = self.letter_sum(letter)
280
282 result = {}
283 for letter in self.alphabet.letters:
284 result[letter] = 0.0
285 for pair, value in self.iteritems():
286 i1, i2 = pair
287 if i1==i2:
288 result[i1] += value
289 else:
290 result[i1] += value / 2
291 result[i2] += value / 2
292 return result
293
294 - def print_full_mat(self,f=None,format="%4d",topformat="%4s",
295 alphabet=None,factor=1,non_sym=None):
296 f = f or sys.stdout
297
298
299 assert non_sym == None or type(non_sym) == type(1.) or \
300 type(non_sym) == type(1)
301 full_mat = copy.copy(self)
302 for i in self:
303 if i[0] != i[1]:
304 full_mat[(i[1],i[0])] = full_mat[i]
305 if not alphabet:
306 alphabet = self.ab_list
307 topline = ''
308 for i in alphabet:
309 topline = topline + topformat % i
310 topline = topline + '\n'
311 f.write(topline)
312 for i in alphabet:
313 outline = i
314 for j in alphabet:
315 if alphabet.index(j) > alphabet.index(i) and non_sym is not None:
316 val = non_sym
317 else:
318 val = full_mat[i,j]
319 val *= factor
320 if val <= -999:
321 cur_str = ' ND'
322 else:
323 cur_str = format % val
324
325 outline = outline+cur_str
326 outline = outline+'\n'
327 f.write(outline)
328
329 - def print_mat(self,f=None,format="%4d",bottomformat="%4s",
330 alphabet=None,factor=1):
331 """Print a nice half-matrix. f=sys.stdout to see on the screen
332 User may pass own alphabet, which should contain all letters in the
333 alphabet of the matrix, but may be in a different order. This
334 order will be the order of the letters on the axes"""
335
336 f = f or sys.stdout
337 if not alphabet:
338 alphabet = self.ab_list
339 bottomline = ''
340 for i in alphabet:
341 bottomline = bottomline + bottomformat % i
342 bottomline = bottomline + '\n'
343 for i in alphabet:
344 outline = i
345 for j in alphabet[:alphabet.index(i)+1]:
346 try:
347 val = self[j,i]
348 except KeyError:
349 val = self[i,j]
350 val *= factor
351 if val == -999:
352 cur_str = ' ND'
353 else:
354 cur_str = format % val
355
356 outline = outline+cur_str
357 outline = outline+'\n'
358 f.write(outline)
359 f.write(bottomline)
360
362 """Print a nice half-matrix."""
363 output = ""
364 alphabet = self.ab_list
365 n = len(alphabet)
366 for i in range(n):
367 c1 = alphabet[i]
368 output += c1
369 for j in range(i+1):
370 c2 = alphabet[j]
371 try:
372 val = self[c2,c1]
373 except KeyError:
374 val = self[c1,c2]
375 if val == -999:
376 output += ' ND'
377 else:
378 output += "%4d" % val
379 output += '\n'
380 output += '%4s' * n % tuple(alphabet) + "\n"
381 return output
382
384 """ returns a number which is the subtraction product of the two matrices"""
385 mat_diff = 0
386 for i in self.keys():
387 mat_diff += (self[i] - other[i])
388 return mat_diff
389
391 """ returns a matrix for which each entry is the multiplication product of the
392 two matrices passed"""
393 new_mat = copy.copy(self)
394 for i in self.keys():
395 new_mat[i] *= other[i]
396 return new_mat
397
399 new_mat = copy.copy(self)
400 for i in self.keys():
401 new_mat[i] += other[i]
402 return new_mat
403
405 """Accepted replacements matrix"""
406
408 """Observed frequency matrix"""
409
411 """Expected frequency matrix"""
412
413
415 """Substitution matrix"""
416
418 """Calculate and return the relative entropy with respect to an
419 observed frequency matrix"""
420 relative_entropy = 0.
421 for key, value in self.iteritems():
422 if value > EPSILON:
423 relative_entropy += obs_freq_mat[key]*log(value)
424 relative_entropy /= log(2)
425 return relative_entropy
426
427
429 """Log odds matrix"""
430
432 """Calculate and return the relative entropy with respect to an
433 observed frequency matrix"""
434 relative_entropy = 0.
435 for key, value in self.iteritems():
436 relative_entropy += obs_freq_mat[key]*value/log(2)
437 return relative_entropy
438
439
441 """
442 build_obs_freq_mat(acc_rep_mat):
443 Build the observed frequency matrix, from an accepted replacements matrix
444 The acc_rep_mat matrix should be generated by the user.
445 """
446
447 total = float(sum(acc_rep_mat.values()))
448 obs_freq_mat = ObservedFrequencyMatrix(alphabet=acc_rep_mat.alphabet,
449 build_later=1)
450 for i in acc_rep_mat:
451 obs_freq_mat[i] = acc_rep_mat[i]/total
452 return obs_freq_mat
453
455 exp_freq_table = {}
456 for i in obs_freq_mat.alphabet.letters:
457 exp_freq_table[i] = 0.
458 for i in obs_freq_mat.keys():
459 if i[0] == i[1]:
460 exp_freq_table[i[0]] += obs_freq_mat[i]
461 else:
462 exp_freq_table[i[0]] += obs_freq_mat[i] / 2.
463 exp_freq_table[i[1]] += obs_freq_mat[i] / 2.
464 return FreqTable.FreqTable(exp_freq_table,FreqTable.FREQ)
465
467 """Build an expected frequency matrix
468 exp_freq_table: should be a FreqTable instance
469 """
470 exp_freq_mat = ExpectedFrequencyMatrix(alphabet=exp_freq_table.alphabet,
471 build_later=1)
472 for i in exp_freq_mat:
473 if i[0] == i[1]:
474 exp_freq_mat[i] = exp_freq_table[i[0]]**2
475 else:
476 exp_freq_mat[i] = 2.0*exp_freq_table[i[0]]*exp_freq_table[i[1]]
477 return exp_freq_mat
478
479
480
482 """ Build the substitution matrix """
483 if obs_freq_mat.ab_list != exp_freq_mat.ab_list:
484 raise ValueError("Alphabet mismatch in passed matrices")
485 subs_mat = SubstitutionMatrix(obs_freq_mat)
486 for i in obs_freq_mat:
487 subs_mat[i] = obs_freq_mat[i]/exp_freq_mat[i]
488 return subs_mat
489
490
491
492
494 """_build_log_odds_mat(subs_mat,logbase=10,factor=10.0,round_digit=1):
495 Build a log-odds matrix
496 logbase=2: base of logarithm used to build (default 2)
497 factor=10.: a factor by which each matrix entry is multiplied
498 round_digit: roundoff place after decimal point
499 keep_nd: if true, keeps the -999 value for non-determined values (for which there
500 are no substitutions in the frequency substitutions matrix). If false, plants the
501 minimum log-odds value of the matrix in entries containing -999
502 """
503 lo_mat = LogOddsMatrix(subs_mat)
504 for key, value in subs_mat.iteritems():
505 if value < EPSILON:
506 lo_mat[key] = -999
507 else:
508 lo_mat[key] = round(factor*log(value)/log(logbase),round_digit)
509 mat_min = min(lo_mat.values())
510 if not keep_nd:
511 for i in lo_mat.keys():
512 if lo_mat[i] <= -999:
513 lo_mat[i] = mat_min
514 return lo_mat
515
516
517
518
519
520
521
522 -def make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=2,
523 factor=1.,round_digit=9,keep_nd=0):
531
537
538 -def read_text_matrix(data_file,mat_type=NOTYPE):
539 matrix = {}
540 tmp = data_file.read().split("\n")
541 table=[]
542 for i in tmp:
543 table.append(i.split())
544
545 for rec in table[:]:
546 if (rec.count('#') > 0):
547 table.remove(rec)
548
549
550 while (table.count([]) > 0):
551 table.remove([])
552
553 alphabet = table[0]
554 j = 0
555 for rec in table[1:]:
556
557 row = alphabet[j]
558
559 if re.compile('[A-z\*]').match(rec[0]):
560 first_col = 1
561 else:
562 first_col = 0
563 i = 0
564 for field in rec[first_col:]:
565 col = alphabet[i]
566 matrix[(row,col)] = float(field)
567 i += 1
568 j += 1
569
570 for i in matrix.keys():
571 if '*' in i: del(matrix[i])
572 ret_mat = SeqMat(matrix,mat_type=mat_type)
573 return ret_mat
574
575 diagNO = 1
576 diagONLY = 2
577 diagALL = 3
578
580 rel_ent = 0.
581 key_list_1 = mat_1.keys(); key_list_2 = mat_2.keys()
582 key_list_1.sort(); key_list_2.sort()
583 key_list = []
584 sum_ent_1 = 0.; sum_ent_2 = 0.
585 for i in key_list_1:
586 if i in key_list_2:
587 key_list.append(i)
588 if len(key_list_1) != len(key_list_2):
589
590 sys.stderr.write("Warning:first matrix has more entries than the second\n")
591 if key_list_1 != key_list_2:
592 sys.stderr.write("Warning: indices not the same between matrices\n")
593 for key in key_list:
594 if diag == diagNO and key[0] == key[1]:
595 continue
596 if diag == diagONLY and key[0] != key[1]:
597 continue
598 if mat_1[key] > EPSILON and mat_2[key] > EPSILON:
599 sum_ent_1 += mat_1[key]
600 sum_ent_2 += mat_2[key]
601
602 for key in key_list:
603 if diag == diagNO and key[0] == key[1]:
604 continue
605 if diag == diagONLY and key[0] != key[1]:
606 continue
607 if mat_1[key] > EPSILON and mat_2[key] > EPSILON:
608 val_1 = mat_1[key] / sum_ent_1
609 val_2 = mat_2[key] / sum_ent_2
610
611 rel_ent += val_1 * log(val_1/val_2)/log(logbase)
612 return rel_ent
613
614
616 try:
617 import numpy
618 except ImportError:
619 raise ImportError, "Please install Numerical Python (numpy) if you want to use this function"
620 values = []
621 assert mat_1.ab_list == mat_2.ab_list
622 for ab_pair in mat_1:
623 try:
624 values.append((mat_1[ab_pair], mat_2[ab_pair]))
625 except KeyError:
626 raise ValueError, "%s is not a common key" % ab_pair
627 correlation_matrix = numpy.corrcoef(values, rowvar=0)
628 correlation = correlation_matrix[0,1]
629 return correlation
630
631
632
634 assert mat_1.ab_list == mat_2.ab_list
635 assert pi_1 > 0 and pi_2 > 0 and pi_1< 1 and pi_2 <1
636 assert not (pi_1 + pi_2 - 1.0 > EPSILON)
637 sum_mat = SeqMat(build_later=1)
638 sum_mat.ab_list = mat_1.ab_list
639 for i in mat_1.keys():
640 sum_mat[i] = pi_1 * mat_1[i] + pi_2 * mat_2[i]
641 sum_mat.make_entropy()
642 mat_1.make_entropy()
643 mat_2.make_entropy()
644
645 dJS = sum_mat.entropy - pi_1 * mat_1.entropy - pi_2 *mat_2.entropy
646 return dJS
647
648 """
649 This isn't working yet. Boo hoo!
650 def two_mat_print(mat_1, mat_2, f=None,alphabet=None,factor_1=1, factor_2=1,
651 format="%4d",bottomformat="%4s",topformat="%4s",
652 topindent=7*" ", bottomindent=1*" "):
653 f = f or sys.stdout
654 if not alphabet:
655 assert mat_1.ab_list == mat_2.ab_list
656 alphabet = mat_1.ab_list
657 len_alphabet = len(alphabet)
658 print_mat = {}
659 topline = topindent
660 bottomline = bottomindent
661 for i in alphabet:
662 bottomline += bottomformat % i
663 topline += topformat % alphabet[len_alphabet-alphabet.index(i)-1]
664 topline += '\n'
665 bottomline += '\n'
666 f.write(topline)
667 for i in alphabet:
668 for j in alphabet:
669 print_mat[i,j] = -999
670 diag_1 = {}; diag_2 = {}
671 for i in alphabet:
672 for j in alphabet[:alphabet.index(i)+1]:
673 if i == j:
674 diag_1[i] = mat_1[(i,i)]
675 diag_2[i] = mat_2[(alphabet[len_alphabet-alphabet.index(i)-1],
676 alphabet[len_alphabet-alphabet.index(i)-1])]
677 else:
678 if i > j:
679 key = (j,i)
680 else:
681 key = (i,j)
682 mat_2_key = [alphabet[len_alphabet-alphabet.index(key[0])-1],
683 alphabet[len_alphabet-alphabet.index(key[1])-1]]
684 # print mat_2_key
685 mat_2_key.sort(); mat_2_key = tuple(mat_2_key)
686 # print key ,"||", mat_2_key
687 print_mat[key] = mat_2[mat_2_key]
688 print_mat[(key[1],key[0])] = mat_1[key]
689 for i in alphabet:
690 outline = i
691 for j in alphabet:
692 if i == j:
693 if diag_1[i] == -999:
694 val_1 = ' ND'
695 else:
696 val_1 = format % (diag_1[i]*factor_1)
697 if diag_2[i] == -999:
698 val_2 = ' ND'
699 else:
700 val_2 = format % (diag_2[i]*factor_2)
701 cur_str = val_1 + " " + val_2
702 else:
703 if print_mat[(i,j)] == -999:
704 val = ' ND'
705 elif alphabet.index(i) > alphabet.index(j):
706 val = format % (print_mat[(i,j)]*factor_1)
707 else:
708 val = format % (print_mat[(i,j)]*factor_2)
709 cur_str = val
710 outline += cur_str
711 outline += bottomformat % (alphabet[len_alphabet-alphabet.index(i)-1] +
712 '\n')
713 f.write(outline)
714 f.write(bottomline)
715 """
716