The file `Demos/rrm.hmm` gives an example of a HMMER ASCII save
file. An abridged version is shown here, where (...) mark deletions
made for clarity and space:

HMMER2.0 NAME rrm DESC LENG 72 ALPH Amino RF no CS no COM ../src/hmmbuild rrm.hmm rrm.slx COM hmmcalibrate rrm.hmm NSEQ 70 DATE Mon Jan 19 08:11:49 1998 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 (...) EVD -49.999123 0.271164 HMM A C D E F G H I (...) m->m m->i m->d i->m i->i d->m d->d b->m m->e -21 * -6129 1 -1234 -371 -8214 -7849 -5304 -8003 -7706 2384 (...) - -149 -500 233 43 -381 399 106 -626 (...) - -11 -11284 -12326 -894 -1115 -701 -1378 -21 * 2 -3634 -3460 -5973 -5340 3521 -2129 -4036 -831 (...) - -149 -500 233 43 -381 399 106 -626 (...) - -11 -11284 -12326 -894 -1115 -701 -1378 * * (...) 71 -1165 -4790 -240 -275 -5105 -4306 1035 -2009 (...) - -149 -500 233 43 -381 398 106 -626 (...) - -43 -6001 -12336 -150 -3342 -701 -1378 * * 72 -1929 1218 -1535 -1647 -3990 -4677 -3410 1725 (...) - * * * * * * * * (...) - * * * * * * * * 0 //

HMMER2 profile HMM save files have a very different format compared to the previous HMMER1 ASCII formats. The HMMER2 format provides all the necessary parameters to compare a protein sequence to a HMM, including the search mode of the HMM (hmmls, hmmfs, hmmsw, and hmms in the old HMMER1 package), the null (background) model, and the statistics to evaluate the match on the basis of a previously fitted extreme value distribution.

The format consists of one or more HMMs. Each HMM starts with the identifier HMMER2 on a line by itself and ends with // on a line by itself. The identifier allows backward compatibility as the HMMer software evolves. The closing // allows multiple HMMs to be concatenated into a single file to provide a database of HMMs.

The format for an HMM is divided into two regions. The first region
contains text information and miscalleneous parameters in a (roughly)
tag-value scheme, akin to EMBL formats. This section is ended by a
line beginning with the keyword `HMM`. The second region is of a
more fixed format and contains the main model parameters. It is ended
by the // that ends the entire definition for a single profile-HMM.

Both regions contain probabilities that are used parameterize the HMM.
These are stored as integers which are related to the probability via
a log-odds calculation. The log-odds score calculation is defined in
`mathsupport.c` and is:

so conversely, to get a probability from the scores in an HMM save file:

`INTSCALE` is defined in `config.h` as 1000.

Notice that you must know a null model probability to convert scores back to HMM probabilities.

The special case of prob = 0 is translated to ``*'', so a score of * is read as a probability of 0. Null model probabilities are not allowed to be 0.

This log-odds format has been chosen because it has a better dynamic
range than storing probabilities as ASCII text, and because the
numbers are interpretable to a certain extent: positive values means a
better than expected probability, and negative values a worse than
expected probability. However, because of the conversion from
probabilities, it should be noted that *you should not edit the
numbers in a HMMER save file directly*. The HMM is a probabilistic
model and expects state transition and symbol emission probability
distributions to sum to one. If you want to edit the HMM, you must
understand the underlying Plan7 probabilistic model, and ensure the
correct summations yourself.

A more detailed description of the format now follows.

In the header section, each line after the initial identifier has a unique tag of five characters or less. For shorter tags, the remainder of the five characters is padded with spaces. Therefore the first six characters of these lines are reserved for the tag and a space. The remainder of the line starts at the seventh character. The parser does require this.

- [
] File format version; a unique identifier for this save file format. Used for backwards compatibility.`HMMER2.0`*Not*necessarily the version number of the HMMER software that generated it; rather, the version number of the last HMMER that changed the format (i.e., HMMER 2.8 might still be writing save files that are headed`HMMER2.0`).**Mandatory.** - [
] Model name;`NAME <s>``<s>`is a single word name for the HMM. No spaces or tabs may occur in the name. By default,`hmmbuild`sets this using the name of the alignment file, after removing any file type suffix. For example, an HMM built from the alignment file`rrm.slx`would be named`rrm`by default.**Mandatory.** - [
] Description line;`DESC <s>``<s>`is a one-line description of the HMM. Currently, there is no way to set this! A future extension to SELEX file format will allow us to pick up the description line from Pfam.**Mandatory.** - [
] Model length;`LENG <d>``<d>`, a positive nonzero integer, is the number of match states in the model.**Mandatory.** - [
] Symbol alphabet;`ALPH <s>``<s>`must be either`Amino`or`Nucleic`. This determines the symbol alphabet and the size of the symbol emission probability distributions. If`Amino`, the alphabet size is set to 20 and the symbol alphabet to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order). If`Nucleic`, the alphabet size is set to 4 and the symbol alphabet to ``ACGT''. Case insensitive.**Mandatory.** - [
] Reference annotation flag;`RF <s>``<s>`must be either`no`or`yes`(case insensitive). If set to`yes`, a character of reference annotation is read for each match state/consensus column in the main section of the file (see below); else this data field will be ignored. Reference annotation lines are currently somewhat inconsistently used. The only major use in HMMER is to specify which columns of an alignment get turned into match states when using the`hmmbuild -hand`manual model construction option. Reference annotation can only be picked up from SELEX format alignments. See description of SELEX format for more details on reference annotation lines.**Optional**; assumed to be no if not present. - [
] Consensus structure annotation flag;`CS <s>``<s>`must be either`no`or`yes`(case insensitive). If set to`yes`, a character of consensus structure annotation is read for each match state/consensus column in the main section of the file (see below); else this data field will be ignored. Consensus structure annotation lines are currently somewhat inconsistently used. Consensus structure annotation can only be picked up from SELEX format alignments. See description of SELEX format for more details on consensus structure annotation lines.**Optional**; assumed to be no if not present. - [
] Command line log;`COM <s>``<s>`is a one-line command. There may be more than one`COM`line per save file. These lines record the command line for every HMMER command that modifies the save file. This helps us automatically log Pfam construction strategies, for example.**Optional.** - [
] Sequence number;`NSEQ <d>``<d>`is a nonzero positive integer, the number of sequences that the HMM was trained on. This field is only used for logging purposes.**Optional.** - [
] Creation date;`DATE <s>``<s>`is a date string. This field is only used for logging purposes.**Optional.** - [
] Eight ``special'' transitions for controlling parts of the algorithm-specific parts of the Plan7 model. The null probability used to convert these back to model probabilities is 1.0. The order of the eight fields is N B, N N, E C, E J, C T, C C, J B, J J. (Another way to view the order is as four transition probability distributions for N,E,C,J; each distribution has two probabilities, the first one for ``moving'' and the second one for ``looping''.) For an explanation of these special transitions (and definition of the state names), read the Plan7 architecture documentation.`XT <d>*8`**Mandatory.** - [
] The transition probability distribution for the null model (single G state). The null probability used to convert these back to model probabilities is 1.0. The order is G G, G F.`NULT <d> <d>`**Mandatory.** - [
] The symbol emission probability distribution for the null model (G state); consists of`NULE <d>*K`*K*(e.g. 4 or 20) integers. The null probability used to convert these back to model probabilities is 1/*K*. (Yes, it's a little weird to have a ``null probability'' for the null model symbol emission probabilities; this is strictly an aesthetic decision, so one can look at the null model and easily tell which amino acids are more common than chance expectation in the background distribution.)**Mandatory.** - [
] The extreme value distribution parameters and , respectively; both floating point values. is positive and nonzero. These values are set when the model is calibrated with`EVD <f> <f>``hmmcalibrate`. They are used to determine E-values of bit scores. If this line is not present, E-values are calculated using a conservative analytic upper bound.**Optional.** - [
] HMM flag line; flags the end of the header section. Otherwised not parsed. Strictly for human readability, the symbol alphabet is also shown on this line, aligned to the`HMM``NULE`fields and the fields of the match and insert symbol emission distributions in the main model. The immediately next line is also an unparsed human annotation line: column headers for the state transition probability fields in the main model section that follows. Both lines are**mandatory.**

All the remaining fields are **mandatory**.

The first line in the main model section is atypical; it contains
three fields, for transitions from the B state into the first node of
the model. *The only purpose of this line is to set the B
D transition probability*. The first field is the score
for
.
The second field is always ``*'' (there is no B
I transition). The third field is the score for
.
The null probability used for converting these
scores back to probabilities is 1.0. In principle, only the third
number is needed to obtain
.
In practice, HMMER
reads both the first and the third number, converts them to
probabilities, and renormalizes the distribution to obtain
.

The remainder of the model has three lines per node, for *M* nodes
(where *M* is the number of match states, as given by the `LENG`
line). These three lines are:

- [
**Match emission line**] The first field is the node number (1..M). The HMMER parser verifies this number as a consistency check (it expects the nodes to come in order). The remaining fields are*K*numbers for match emission scores, one per symbol. The null probability used to convert them to probabilities is the relevant null model emission probability calculated from the`NULE`line. - [
**Insert emission line**] The first field is a character of reference annotation (RF), or ``-'' if there is no reference annotation. The remaining fields are*K*numbers for insert emission scores, one per symbol, in alphabetic order. The null probability used to convert them to probabilities is the relevant null model emission probability calculated from the`NULE`line. - [
**State transition line**] The first field is a character of consensus structure annotation (CS), or ``-'' if there is no consensus structure annotation. The remaining 9 fields are state transition scores. The null probability used to convert them back from log odds scores to probabilities is 1.0. The order of these scores is given by the annotation line at the top of the main section: it is M M, M I, M D; I M, I D; D M, D D; B M; M E.

The insert emission and state transition lines for the final node *M*are special. Node *M* has no insert state, so all the insert
emissions are given as ``*''. (In fact, this line is skipped by the
parser, except for its RF annotation.) There is also no next node, so
only the B
M and M
E transitions are
valid; the first seven transitions are always ``*''. (Incidentally,
the M
E transition score for the last node is always 0,
because this probability has to be 1.0.)

Finally, the last line of the format is the ``//'' record separator.

After the parser reads the file and converts the scores back to
probabilities, it renormalizes the probability distributions to sum to
1.0 to eliminate minor rounding/conversion/numerical imprecision
errors. If you're trying to emulate HMMER save files, it might be
useful to know what HMMER considers to be a probability
distribution. See
`Plan7Renormalize()` in `plan7.c` for the relevant
function.

- [
**null emissions**] The*K*symbol emissions given on the`NULE`line. - [
**null transitions**] The two null model transitions given on the`NULT`line. - [
**N,E,C,J specials**] Each of the four special states N,E,C,J have two state transition probabilities (move and loop). All four distributions are specified on the`XT`line. - [
**B transitions**]*M*B M entry probabilities are given by the 9th field in the state transition line of each of the*M*nodes. The B D transition (from the atypical first line of the main model section) is also part of this state transition distribution. - [
**match transitions**] One distribution of 4 numbers per node; , , , and (fields 2, 3, 4, and 10 in the state transition line of each node). Note the asymmetry between B M and M E; entries are a probability distribution of their own, while exits are not. - [
**insert transitions**] One distribution of 2 numbers per node; , (fields 5 and 6 of the state transition line of each node). - [
**delete transitions**] One distribution of 2 numbers per node; , (fields 7 and 8 of the state transition line of each node). - [
**match emissions**] One distribution of*K*numbers per node; the*K*match symbol emissions given on the first line of each node in the main section. - [
**insert emissions**] One distribution of*K*numbers per node; the*K*insert symbol emissions given on the second line of each node in the main section.

Though I make an effort to keep this documentation up to date, it may
lag behind the code. For definitive answers, please check the parsing
code in `hmmio.c`. The relevant function to see what's being
written is `WriteAscHMM()`. The relevant function to see how it's
being parsed is `read_asc20hmm()`.