Long description of the convert.pl script

Long description of the `convert.pl` script

1. The models

Let us start with the models. The SAM3 model is:

whereas the HMMER2 "Plan7" model is:

In both diagrams the circles stand for deletes, diamonds for insertions and squares for matches. The colouring shows how information is stored in the two file formats and will be discussed later. At this stage it is sufficient to note the four key differences:

The HMMER2 model consists of a core part roughly similar to the SAM model, plus a number of additional nodes
There are two extra insert nodes in the SAM model as compared to the HMMER core
There are extra D->I and I->D transitions in the SAM model
There are extra B->Mx and Mx->E transitions in the HMMER model

The first difference results in the limitations of this script: the search method (local/local etc.) is contained in the additional nodes of the HMMER model and stored in the save file, whereas in the SAM package it is specified at search-time. The authors of this script were only interested in the specific case of a local/local search and so did not include all the possibilities in the script; however, with the aid of this documentation other possibilities should not be too difficult to add.

The second and third differences result in a loss of information when converting from SAM to HMMER. (No information is lost when converting the other way.) For a sensible model of 100 or more match nodes a loss of 2 insert nodes at the beginning and end of the model should not have much effect, but the same cannot be said -- a priori -- of the loss of the insert-delete transitions. After setting these to zero in a number of SAM models and renormalising, however, we have observed no change in performance and the delete-insert transitions in the SAM model therefore appear entirely redundant.

The fourth difference does not actually result in a loss of information when converting between SAM and HMMER, in the sense that these transition probabilities appear to be given by a simple formula discussed later. The possible redundancy of these transitions was not investigated by the authors.

2. The file formats

Now to the save files themselves. The structure of a typical SAM ascii file created by the w0.5 or fw0.7 scripts and hmmconvert is:

%         :
%  [Some comments]
%         :
MODEL [some details]
alphabet protein
FREQAVE
0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 
0.063538 0.022206 0.048871 0.060979 0.038441 
0.059832 0.023936 0.060595 0.055899 0.086760 
0.020237 0.045507 0.049257 0.039385 0.048255 
0.071250 0.066954 0.081761 0.017985 0.038349 
0.063538 0.022206 0.048871 0.060979 0.038441 
0.059832 0.023936 0.060595 0.055899 0.086760 
0.020237 0.045507 0.049257 0.039385 0.048255 
0.071250 0.066954 0.081761 0.017985 0.038349 
Begin
0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 
0.000000 0.852573 0.976545 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.077669 0.016881 0.057170 0.065752 0.039223 
0.062365 0.024115 0.060124 0.064073 0.082194 
0.024210 0.047157 0.040425 0.041828 0.049012 
0.068753 0.060345 0.072420 0.011511 0.034775 
   1
0.000000 0.026330 0.006903 
0.000000 0.121097 0.016552 
0.006787 0.002413 0.633785 
0.078470 0.014492 0.040764 0.068528 0.043432 
0.050361 0.044873 0.059960 0.074867 0.109772 
0.028121 0.047173 0.041645 0.052252 0.020926 
0.073702 0.026842 0.069371 0.010783 0.043666 
0.077850 0.016640 0.057325 0.065451 0.039844 
0.062177 0.024314 0.060479 0.064418 0.082109 
0.024547 0.047302 0.040276 0.041960 0.048771 
0.067948 0.059895 0.072042 0.011740 0.034912 
   2
   :
   3
   :
   4
   :
End
0.000000 0.000000 0.000000 
0.024238 0.667485 0.022067 
0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 

ENDMODEL

whilst a typical HMMER2.0 file created via hmmbuild -f looks thus:

HMMER2.0
NAME  igs.pfam
LENG  45
ALPH  Amino
       :
[some non-mandatory tags]
       :
XT      -8455     -4  -1000  -1000  -8455     -4  -8455     -4 
NULT      -4  -8455
NULE     595  -1558     85    338   -294    453  -1158    197    249    902  ...
       :
[some non-mandatory tags]
       :
HMM        A      C      D      E      F      G      H      I      K      L  ...
         m->m   m->i   m->d   i->m   i->i   d->m   d->d   b->m   m->e
          -13      *  -6755
     1  -2902  -5397   -782    -65  -5718   3089    142  -5469  -1006  -3213 ...
     -   -149   -500    233     43   -381    399    106   -626    210   -466 ...
     -    -17 -11924 -12966   -894  -1115   -701  -1378  -1013  -6459
     2   -397   -825    529   1685  -5716    912   1172  -2586   -131  -5411 ...
     -   -149   -500    233     43   -381    399    106   -626    210   -466 ...
     -    -17 -11925 -12967   -894  -1115   -701  -1378  -6473  -6443
     :
    45   1367  -4042  -6533  -5898   -421   -418   -298  -2001  -5499   -408 ...
     -      *      *      *      *      *      *      *      *      *      * ...
     -      *      *      *      *      *      *      *  -6473      0
//

The files have a similar structure -- both begin with headers followed by data for each position in the model and end with a closing tag, ENDMODEL for SAM and // for HMMER.

The numbers in the SAM file are the actual transition/emission probabilities themselves, whereas HMMER chooses to store its integral "scores". The following formulae convert between the scores and the actual probabilities:

score = (int) floor [ 0.5 + 1000.0*log(prob/nullProb)/log(2.0) ] prob = nullProb * 2 ^ (score/1000.0)

where nullProb is the so-called null probability. To convert between scores and probabilities, the null probability for the particular data item must be known. (Since the logarithm of 0 is not defined, zero probabilities are stored as score "*".)

2.1 The headers

The SAM header is rather simple, especially as the FREQAVE part is not mandatory and according to SAM authors not used at present. The HMMER header, on the other hand, is quite involved. It begins with the HMMER2.0 tag, followed by a series of lines with the following structure:

[tag (5 characters)][space][data]

and ends with the HMM tag. The tags must be five characters long (shorter ones are padded with spaces at the end), so that the data start on the 7th character of each line. The meanings are as follows:

Tag Description Null probability

LENG Number of positions in the model -

XT Scores for transition probabilities between the additional (non-core) nodes of the model (in yellow in the diagram), in the following order:
[N->B] [N->N] E->C] [E->J] [C->T] [C->C] [J->B] [J->J]
1.0

NULT Scores for transition probabilities of the null model (not important here) 1.0

NULE Scores for emission probabilities of the null model in alphabetic order of the 1-letter amino-acid codes 1/20

2.2 The data

The syntax of a SAM data block is:

[position number]
[D->D] [M->D] [I->D]
[D->M] [M->M] [I->M]
[D->I] [M->I] [I->I]
[20  match emission probabilities]
[20 insert emission probabilities]

where the first 9 numbers are the state transition probabilities. The HMMER block is similar,

[position number] [20  match emission scores]
       -          [20 insert emission scores]
       -          [M->M] [M->I] [M->D] [I->M] [I->I] [D->M] [D->D] [B->M] [M->E]

The null probabilities for emissions are the null model probabilities given in NULE. (NB do not forget to convert the null model scores actually given on the NULE line to probabilities using the above formula with nullProb=1/20.) The columns are 7 characters wide, with the exception of the first column, which is 6 characters wide, as in the header.

If there are M positions (data blocks) in a HMMER file, there will be M+2 positions in the SAM file: Begin, 1, ..., M, End. The data stored in the Begin and End blocks are shown in red in the SAM model, with the emission probabilities for the B and E nodes being 0. Data in consecutive blocks for both file formats are then shown in blue for positions 1,3 and green for positions 2,4. (There is a simple mnemonic for remembering which transition probability is stored where: SAM stores the transition probabilities to a particular node, HMMER stores the transition scores from a particular node.) In the SAM file, the non-existent transitions from M4 are stored as "*", whilst the D4->E transition isn't stored anywhere and the authors believe that it is implicitly set to 1.

As a final peculiarity, the B->D1 transition in SAM is stored on a single line immediately preceding the first data block. With a null-probability of 1.0, this line has the following structure:

[ score(1-P(B->D1)) ] * [ score(P(B->D1)) ]

Although the line only contains the B->D1 probability, the current HMMER parser demands the above syntax.

3. The program

Most of the code should be rather self-explanatory after reading the above explanations. Three notes, though: firstly, the main code at the end of the script simply decides what to do based on the extension of the filename passed to it, and secondly, the internal format of the script is essentially the SAM ascii format:

$states[$position] ->[0...8]
$matches[$position]->[0...19]
$inserts[$position]->[0...19]

where @states holds the transition probabilities (from 0 to 8: D->D, M->D, I->D, D->M, M->M, I->M, D->I, M->I and I->I) and @matches, @inserts the emission probabilities (0=A etc.). $M is the number of match nodes in the model excluding the Begin and End ones. $position=0 is the Begin position, $position=$M+1 is the End position.

(The authors now realise that this is in fact stupid: normalising transition probabilities say from a match node requires summing over different positions in @states which is not particularly elegant.)

And finally, the formulae for B->Mx and Mx->E transitions in HMMER. There seem to be some disagreements between the HMMER documentation and the actual state of affairs, so some experimentation was necessary. Check the code for actual details.