HMMER save files

The file Demos/rrm.hmm gives an example of a HMMER ASCII save file. An abridged version is shown here, where (...) mark deletions made for clarity and space:

HMMER2.0
NAME  rrm
DESC  
LENG  72
ALPH  Amino
RF    no
CS    no
COM   ../src/hmmbuild rrm.hmm rrm.slx
COM   hmmcalibrate rrm.hmm
NSEQ  70
DATE  Mon Jan 19 08:11:49 1998
XT      -8455     -4  -1000  -1000  -8455     -4  -8455     -4 
NULT      -4  -8455
NULE     595  -1558     85    338   -294    453  -1158    (...)
EVD   -49.999123   0.271164
HMM        A      C      D      E      F      G      H      I (...)
         m->m   m->i   m->d   i->m   i->i   d->m   d->d   b->m   m->e
          -21      *  -6129
     1  -1234   -371  -8214  -7849  -5304  -8003  -7706   2384  (...)
     -   -149   -500    233     43   -381    399    106   -626  (...)
     -    -11 -11284 -12326   -894  -1115   -701  -1378    -21      * 
     2  -3634  -3460  -5973  -5340   3521  -2129  -4036   -831  (...)
     -   -149   -500    233     43   -381    399    106   -626  (...)
     -    -11 -11284 -12326   -894  -1115   -701  -1378      *      * 
(...)
    71  -1165  -4790   -240   -275  -5105  -4306   1035  -2009  (...) 
     -   -149   -500    233     43   -381    398    106   -626  (...)
     -    -43  -6001 -12336   -150  -3342   -701  -1378      *      * 
    72  -1929   1218  -1535  -1647  -3990  -4677  -3410   1725  (...)
     -      *      *      *      *      *      *      *      *  (...) 
     -      *      *      *      *      *      *      *      *      0 
//

HMMER2 profile HMM save files have a very different format compared to the previous HMMER1 ASCII formats. The HMMER2 format provides all the necessary parameters to compare a protein sequence to a HMM, including the search mode of the HMM (hmmls, hmmfs, hmmsw, and hmms in the old HMMER1 package), the null (background) model, and the statistics to evaluate the match on the basis of a previously fitted extreme value distribution.

The format consists of one or more HMMs. Each HMM starts with the identifier HMMER2 on a line by itself and ends with // on a line by itself. The identifier allows backward compatibility as the HMMer software evolves. The closing // allows multiple HMMs to be concatenated into a single file to provide a database of HMMs.

The format for an HMM is divided into two regions. The first region contains text information and miscalleneous parameters in a (roughly) tag-value scheme, akin to EMBL formats. This section is ended by a line beginning with the keyword HMM. The second region is of a more fixed format and contains the main model parameters. It is ended by the // that ends the entire definition for a single profile-HMM.

Both regions contain probabilities that are used parameterize the HMM. These are stored as integers which are related to the probability via a log-odds calculation. The log-odds score calculation is defined in mathsupport.c and is:

$\begin{displaymath}\mbox{score} = (\mbox{\texttt{int}}) \mbox{\texttt{floor}}(0.... ...ox{\texttt{INTSCALE}} * \log_2(\mbox{prob}/\mbox{null-prob}))) \end{displaymath}$

so conversely, to get a probability from the scores in an HMM save file:

$\begin{displaymath}\mbox{prob} = \mbox{null-prob} * 2^{\mbox{score}/\mbox{\texttt{INTSCALE}}} \end{displaymath}$

INTSCALE is defined in config.h as 1000.

Notice that you must know a null model probability to convert scores back to HMM probabilities.

The special case of prob = 0 is translated to ``*'', so a score of * is read as a probability of 0. Null model probabilities are not allowed to be 0.

This log-odds format has been chosen because it has a better dynamic range than storing probabilities as ASCII text, and because the numbers are interpretable to a certain extent: positive values means a better than expected probability, and negative values a worse than expected probability. However, because of the conversion from probabilities, it should be noted that you should not edit the numbers in a HMMER save file directly. The HMM is a probabilistic model and expects state transition and symbol emission probability distributions to sum to one. If you want to edit the HMM, you must understand the underlying Plan7 probabilistic model, and ensure the correct summations yourself.

A more detailed description of the format now follows.

Header section

In the header section, each line after the initial identifier has a unique tag of five characters or less. For shorter tags, the remainder of the five characters is padded with spaces. Therefore the first six characters of these lines are reserved for the tag and a space. The remainder of the line starts at the seventh character. The parser does require this.

[HMMER2.0] File format version; a unique identifier for this save file format. Used for backwards compatibility. Not necessarily the version number of the HMMER software that generated it; rather, the version number of the last HMMER that changed the format (i.e., HMMER 2.8 might still be writing save files that are headed HMMER2.0). Mandatory.
[NAME <s>] Model name; <s> is a single word name for the HMM. No spaces or tabs may occur in the name. By default, hmmbuild sets this using the name of the alignment file, after removing any file type suffix. For example, an HMM built from the alignment file rrm.slx would be named rrm by default. Mandatory.
[DESC <s>] Description line; <s> is a one-line description of the HMM. Currently, there is no way to set this! A future extension to SELEX file format will allow us to pick up the description line from Pfam. Mandatory.
[LENG <d>] Model length; <d>, a positive nonzero integer, is the number of match states in the model. Mandatory.
[ALPH <s>] Symbol alphabet; <s> must be either Amino or Nucleic. This determines the symbol alphabet and the size of the symbol emission probability distributions. If Amino, the alphabet size is set to 20 and the symbol alphabet to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order). If Nucleic, the alphabet size is set to 4 and the symbol alphabet to ``ACGT''. Case insensitive. Mandatory.
[RF <s>] Reference annotation flag; <s> must be either no or yes (case insensitive). If set to yes, a character of reference annotation is read for each match state/consensus column in the main section of the file (see below); else this data field will be ignored. Reference annotation lines are currently somewhat inconsistently used. The only major use in HMMER is to specify which columns of an alignment get turned into match states when using the hmmbuild -hand manual model construction option. Reference annotation can only be picked up from SELEX format alignments. See description of SELEX format for more details on reference annotation lines. Optional; assumed to be no if not present.
[CS <s>] Consensus structure annotation flag; <s> must be either no or yes (case insensitive). If set to yes, a character of consensus structure annotation is read for each match state/consensus column in the main section of the file (see below); else this data field will be ignored. Consensus structure annotation lines are currently somewhat inconsistently used. Consensus structure annotation can only be picked up from SELEX format alignments. See description of SELEX format for more details on consensus structure annotation lines. Optional; assumed to be no if not present.
[COM <s>] Command line log; <s> is a one-line command. There may be more than one COM line per save file. These lines record the command line for every HMMER command that modifies the save file. This helps us automatically log Pfam construction strategies, for example. Optional.
[NSEQ <d>] Sequence number; <d> is a nonzero positive integer, the number of sequences that the HMM was trained on. This field is only used for logging purposes. Optional.
[DATE <s>] Creation date; <s> is a date string. This field is only used for logging purposes. Optional.
[XT <d>*8] Eight ``special'' transitions for controlling parts of the algorithm-specific parts of the Plan7 model. The null probability used to convert these back to model probabilities is 1.0. The order of the eight fields is N $\rightarrow$ B, N $\rightarrow$ N, E $\rightarrow$ C, E $\rightarrow$ J, C $\rightarrow$ T, C $\rightarrow$ C, J $\rightarrow$ B, J $\rightarrow$ J. (Another way to view the order is as four transition probability distributions for N,E,C,J; each distribution has two probabilities, the first one for ``moving'' and the second one for ``looping''.) For an explanation of these special transitions (and definition of the state names), read the Plan7 architecture documentation. Mandatory.
[NULT <d> <d>] The transition probability distribution for the null model (single G state). The null probability used to convert these back to model probabilities is 1.0. The order is G $\rightarrow$ G, G $\rightarrow$ F. Mandatory.
[NULE <d>*K] The symbol emission probability distribution for the null model (G state); consists of K (e.g. 4 or 20) integers. The null probability used to convert these back to model probabilities is 1/K. (Yes, it's a little weird to have a ``null probability'' for the null model symbol emission probabilities; this is strictly an aesthetic decision, so one can look at the null model and easily tell which amino acids are more common than chance expectation in the background distribution.) Mandatory.
[EVD <f> <f>] The extreme value distribution parameters $\mu$ and $\lambda$ , respectively; both floating point values. $\lambda$ is positive and nonzero. These values are set when the model is calibrated with hmmcalibrate. They are used to determine E-values of bit scores. If this line is not present, E-values are calculated using a conservative analytic upper bound. Optional.
[HMM ] HMM flag line; flags the end of the header section. Otherwised not parsed. Strictly for human readability, the symbol alphabet is also shown on this line, aligned to the NULE fields and the fields of the match and insert symbol emission distributions in the main model. The immediately next line is also an unparsed human annotation line: column headers for the state transition probability fields in the main model section that follows. Both lines are mandatory.

Main model section

All the remaining fields are mandatory.

The first line in the main model section is atypical; it contains three fields, for transitions from the B state into the first node of the model. The only purpose of this line is to set the B $\rightarrow$ D transition probability. The first field is the score for $1 - t(B \rightarrow D)$ . The second field is always ``*'' (there is no B $\rightarrow$ I transition). The third field is the score for $t(B\rightarrow D)$ . The null probability used for converting these scores back to probabilities is 1.0. In principle, only the third number is needed to obtain $t(B\rightarrow D)$ . In practice, HMMER reads both the first and the third number, converts them to probabilities, and renormalizes the distribution to obtain $t(B\rightarrow D)$ .

The remainder of the model has three lines per node, for M nodes (where M is the number of match states, as given by the LENG line). These three lines are:

[Match emission line] The first field is the node number (1..M). The HMMER parser verifies this number as a consistency check (it expects the nodes to come in order). The remaining fields are Knumbers for match emission scores, one per symbol. The null probability used to convert them to probabilities is the relevant null model emission probability calculated from the NULE line.
[Insert emission line] The first field is a character of reference annotation (RF), or ``-'' if there is no reference annotation. The remaining fields are K numbers for insert emission scores, one per symbol, in alphabetic order. The null probability used to convert them to probabilities is the relevant null model emission probability calculated from the NULE line.
[State transition line] The first field is a character of consensus structure annotation (CS), or ``-'' if there is no consensus structure annotation. The remaining 9 fields are state transition scores. The null probability used to convert them back from log odds scores to probabilities is 1.0. The order of these scores is given by the annotation line at the top of the main section: it is M $\rightarrow$ M, M $\rightarrow$ I, M $\rightarrow$ D; I $\rightarrow$ M, I $\rightarrow$ D; D $\rightarrow$ M, D $\rightarrow$ D; B $\rightarrow$ M; M $\rightarrow$ E.

The insert emission and state transition lines for the final node Mare special. Node M has no insert state, so all the insert emissions are given as ``*''. (In fact, this line is skipped by the parser, except for its RF annotation.) There is also no next node, so only the B $\rightarrow$ M and M $\rightarrow$ E transitions are valid; the first seven transitions are always ``*''. (Incidentally, the M $\rightarrow$ E transition score for the last node is always 0, because this probability has to be 1.0.)

Finally, the last line of the format is the ``//'' record separator.

Renormalization

After the parser reads the file and converts the scores back to probabilities, it renormalizes the probability distributions to sum to 1.0 to eliminate minor rounding/conversion/numerical imprecision errors. If you're trying to emulate HMMER save files, it might be useful to know what HMMER considers to be a probability distribution. See Plan7Renormalize() in plan7.c for the relevant function.

[null emissions] The K symbol emissions given on the NULE line.
[null transitions] The two null model transitions given on the NULT line.
[N,E,C,J specials] Each of the four special states N,E,C,J have two state transition probabilities (move and loop). All four distributions are specified on the XT line.
[B transitions] M B $\rightarrow$ M entry probabilities are given by the 9th field in the state transition line of each of the M nodes. The B $\rightarrow$ D transition (from the atypical first line of the main model section) is also part of this state transition distribution.
[match transitions] One distribution of 4 numbers per node; $M \rightarrow M$ , $M \rightarrow I$ , $M \rightarrow D$ , and $M \rightarrow E$ (fields 2, 3, 4, and 10 in the state transition line of each node). Note the asymmetry between B $\rightarrow$ M and M $\rightarrow$ E; entries are a probability distribution of their own, while exits are not.
[insert transitions] One distribution of 2 numbers per node; $I \rightarrow M$ , $I \rightarrow I$ (fields 5 and 6 of the state transition line of each node).
[delete transitions] One distribution of 2 numbers per node; $D \rightarrow M$ , $D \rightarrow D$ (fields 7 and 8 of the state transition line of each node).
[match emissions] One distribution of K numbers per node; the K match symbol emissions given on the first line of each node in the main section.
[insert emissions] One distribution of K numbers per node; the K insert symbol emissions given on the second line of each node in the main section.

Note to developers

Though I make an effort to keep this documentation up to date, it may lag behind the code. For definitive answers, please check the parsing code in hmmio.c. The relevant function to see what's being written is WriteAscHMM(). The relevant function to see how it's being parsed is read_asc20hmm().