HMMER save files

The file Demos/rrm.hmm gives an example of a HMMER ASCII save file. An abridged version is shown here, where (...) mark deletions made for clarity and space:

NAME  rrm
LENG  72
ALPH  Amino
RF    no
CS    no
COM   ../src/hmmbuild rrm.hmm rrm.slx
COM   hmmcalibrate rrm.hmm
NSEQ  70
DATE  Mon Jan 19 08:11:49 1998
XT      -8455     -4  -1000  -1000  -8455     -4  -8455     -4 
NULT      -4  -8455
NULE     595  -1558     85    338   -294    453  -1158    (...)
EVD   -49.999123   0.271164
HMM        A      C      D      E      F      G      H      I (...)
         m->m   m->i   m->d   i->m   i->i   d->m   d->d   b->m   m->e
          -21      *  -6129
     1  -1234   -371  -8214  -7849  -5304  -8003  -7706   2384  (...)
     -   -149   -500    233     43   -381    399    106   -626  (...)
     -    -11 -11284 -12326   -894  -1115   -701  -1378    -21      * 
     2  -3634  -3460  -5973  -5340   3521  -2129  -4036   -831  (...)
     -   -149   -500    233     43   -381    399    106   -626  (...)
     -    -11 -11284 -12326   -894  -1115   -701  -1378      *      * 
    71  -1165  -4790   -240   -275  -5105  -4306   1035  -2009  (...) 
     -   -149   -500    233     43   -381    398    106   -626  (...)
     -    -43  -6001 -12336   -150  -3342   -701  -1378      *      * 
    72  -1929   1218  -1535  -1647  -3990  -4677  -3410   1725  (...)
     -      *      *      *      *      *      *      *      *  (...) 
     -      *      *      *      *      *      *      *      *      0 

HMMER2 profile HMM save files have a very different format compared to the previous HMMER1 ASCII formats. The HMMER2 format provides all the necessary parameters to compare a protein sequence to a HMM, including the search mode of the HMM (hmmls, hmmfs, hmmsw, and hmms in the old HMMER1 package), the null (background) model, and the statistics to evaluate the match on the basis of a previously fitted extreme value distribution.

The format consists of one or more HMMs. Each HMM starts with the identifier HMMER2 on a line by itself and ends with // on a line by itself. The identifier allows backward compatibility as the HMMer software evolves. The closing // allows multiple HMMs to be concatenated into a single file to provide a database of HMMs.

The format for an HMM is divided into two regions. The first region contains text information and miscalleneous parameters in a (roughly) tag-value scheme, akin to EMBL formats. This section is ended by a line beginning with the keyword HMM. The second region is of a more fixed format and contains the main model parameters. It is ended by the // that ends the entire definition for a single profile-HMM.

Both regions contain probabilities that are used parameterize the HMM. These are stored as integers which are related to the probability via a log-odds calculation. The log-odds score calculation is defined in mathsupport.c and is:

\begin{displaymath}\mbox{score} = (\mbox{\texttt{int}}) \mbox{\texttt{floor}}(0....
...ox{\texttt{INTSCALE}} * \log_2(\mbox{prob}/\mbox{null-prob})))

so conversely, to get a probability from the scores in an HMM save file:

\begin{displaymath}\mbox{prob} = \mbox{null-prob} * 2^{\mbox{score}/\mbox{\texttt{INTSCALE}}}

INTSCALE is defined in config.h as 1000.

Notice that you must know a null model probability to convert scores back to HMM probabilities.

The special case of prob = 0 is translated to ``*'', so a score of * is read as a probability of 0. Null model probabilities are not allowed to be 0.

This log-odds format has been chosen because it has a better dynamic range than storing probabilities as ASCII text, and because the numbers are interpretable to a certain extent: positive values means a better than expected probability, and negative values a worse than expected probability. However, because of the conversion from probabilities, it should be noted that you should not edit the numbers in a HMMER save file directly. The HMM is a probabilistic model and expects state transition and symbol emission probability distributions to sum to one. If you want to edit the HMM, you must understand the underlying Plan7 probabilistic model, and ensure the correct summations yourself.

A more detailed description of the format now follows.

Header section

In the header section, each line after the initial identifier has a unique tag of five characters or less. For shorter tags, the remainder of the five characters is padded with spaces. Therefore the first six characters of these lines are reserved for the tag and a space. The remainder of the line starts at the seventh character. The parser does require this.

Main model section

All the remaining fields are mandatory.

The first line in the main model section is atypical; it contains three fields, for transitions from the B state into the first node of the model. The only purpose of this line is to set the B $\rightarrow$ D transition probability. The first field is the score for $ 1 - t(B \rightarrow D)$. The second field is always ``*'' (there is no B $\rightarrow$ I transition). The third field is the score for $t(B\rightarrow D)$. The null probability used for converting these scores back to probabilities is 1.0. In principle, only the third number is needed to obtain $t(B\rightarrow D)$. In practice, HMMER reads both the first and the third number, converts them to probabilities, and renormalizes the distribution to obtain $t(B\rightarrow D)$.

The remainder of the model has three lines per node, for M nodes (where M is the number of match states, as given by the LENG line). These three lines are:

The insert emission and state transition lines for the final node Mare special. Node M has no insert state, so all the insert emissions are given as ``*''. (In fact, this line is skipped by the parser, except for its RF annotation.) There is also no next node, so only the B $\rightarrow$ M and M $\rightarrow$ E transitions are valid; the first seven transitions are always ``*''. (Incidentally, the M $\rightarrow$ E transition score for the last node is always 0, because this probability has to be 1.0.)

Finally, the last line of the format is the ``//'' record separator.


After the parser reads the file and converts the scores back to probabilities, it renormalizes the probability distributions to sum to 1.0 to eliminate minor rounding/conversion/numerical imprecision errors. If you're trying to emulate HMMER save files, it might be useful to know what HMMER considers to be a probability distribution. See Plan7Renormalize() in plan7.c for the relevant function.

Note to developers

Though I make an effort to keep this documentation up to date, it may lag behind the code. For definitive answers, please check the parsing code in hmmio.c. The relevant function to see what's being written is WriteAscHMM(). The relevant function to see how it's being parsed is read_asc20hmm().