The file Demos/rrm.hmm gives an example of a HMMER ASCII save
file. An abridged version is shown here, where (...) mark deletions
made for clarity and space:
HMMER2.0
NAME rrm
DESC
LENG 72
ALPH Amino
RF no
CS no
COM ../src/hmmbuild rrm.hmm rrm.slx
COM hmmcalibrate rrm.hmm
NSEQ 70
DATE Mon Jan 19 08:11:49 1998
XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455
NULE 595 -1558 85 338 -294 453 -1158 (...)
EVD -49.999123 0.271164
HMM A C D E F G H I (...)
m->m m->i m->d i->m i->i d->m d->d b->m m->e
-21 * -6129
1 -1234 -371 -8214 -7849 -5304 -8003 -7706 2384 (...)
- -149 -500 233 43 -381 399 106 -626 (...)
- -11 -11284 -12326 -894 -1115 -701 -1378 -21 *
2 -3634 -3460 -5973 -5340 3521 -2129 -4036 -831 (...)
- -149 -500 233 43 -381 399 106 -626 (...)
- -11 -11284 -12326 -894 -1115 -701 -1378 * *
(...)
71 -1165 -4790 -240 -275 -5105 -4306 1035 -2009 (...)
- -149 -500 233 43 -381 398 106 -626 (...)
- -43 -6001 -12336 -150 -3342 -701 -1378 * *
72 -1929 1218 -1535 -1647 -3990 -4677 -3410 1725 (...)
- * * * * * * * * (...)
- * * * * * * * * 0
//
HMMER2 profile HMM save files have a very different format compared to
the previous HMMER1 ASCII formats. The HMMER2 format provides all the
necessary parameters to compare a protein sequence to a HMM, including
the search mode of the HMM (hmmls, hmmfs, hmmsw, and hmms in the old
HMMER1 package), the null (background) model, and the statistics to
evaluate the match on the basis of a previously fitted extreme value
distribution.
The format consists of one or more HMMs. Each HMM starts with the
identifier HMMER2 on a line by itself and ends with // on a line by
itself. The identifier allows backward compatibility as the HMMer
software evolves. The closing // allows multiple HMMs to be
concatenated into a single file to provide a database of HMMs.
The format for an HMM is divided into two regions. The first region
contains text information and miscalleneous parameters in a (roughly)
tag-value scheme, akin to EMBL formats. This section is ended by a
line beginning with the keyword HMM. The second region is of a
more fixed format and contains the main model parameters. It is ended
by the // that ends the entire definition for a single profile-HMM.
Both regions contain probabilities that are used parameterize the HMM.
These are stored as integers which are related to the probability via
a log-odds calculation. The log-odds score calculation is defined in
mathsupport.c and is:
so conversely, to get a probability from the scores in an HMM save
file:
INTSCALE is defined in config.h as 1000.
Notice that you must know a null model probability to convert scores
back to HMM probabilities.
The special case of prob = 0 is translated to ``*'', so a score of *
is read as a probability of 0. Null model probabilities are not
allowed to be 0.
This log-odds format has been chosen because it has a better dynamic
range than storing probabilities as ASCII text, and because the
numbers are interpretable to a certain extent: positive values means a
better than expected probability, and negative values a worse than
expected probability. However, because of the conversion from
probabilities, it should be noted that you should not edit the
numbers in a HMMER save file directly. The HMM is a probabilistic
model and expects state transition and symbol emission probability
distributions to sum to one. If you want to edit the HMM, you must
understand the underlying Plan7 probabilistic model, and ensure the
correct summations yourself.
A more detailed description of the format now follows.
In the header section, each line after the initial identifier has a
unique tag of five characters or less. For shorter tags, the remainder
of the five characters is padded with spaces. Therefore the first six
characters of these lines are reserved for the tag and a space. The
remainder of the line starts at the seventh character. The parser does
require this.
- [HMMER2.0]
File format version; a unique identifier for this save file format. Used for backwards
compatibility. Not necessarily the version number of the HMMER
software that generated it; rather, the version number of the last
HMMER that changed the format (i.e., HMMER 2.8 might still be writing
save files that are headed HMMER2.0).
Mandatory.
- [NAME <s>] Model name; <s> is a single word name for the HMM.
No spaces or tabs may occur in the name. By default, hmmbuild
sets this using the name of the alignment file, after removing any
file type suffix. For example, an HMM built from the alignment file
rrm.slx would be named rrm by default.
Mandatory.
- [DESC <s>] Description line; <s> is a one-line description
of the HMM. Currently, there is no way to set this! A future extension
to SELEX file format will allow us to pick up the description line
from Pfam.
Mandatory.
- [LENG <d>] Model length; <d>, a positive nonzero integer,
is the number of match states in the model.
Mandatory.
- [ALPH <s>] Symbol alphabet; <s> must be either
Amino or Nucleic. This determines the symbol alphabet and the
size of the symbol emission probability distributions. If
Amino, the alphabet size is set to 20 and the symbol alphabet
to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order). If Nucleic, the
alphabet size is set to 4 and the symbol alphabet to ``ACGT''. Case
insensitive. Mandatory.
- [RF <s>] Reference annotation flag; <s> must
be either no or yes (case insensitive). If set to
yes, a character of reference annotation is read for each match
state/consensus column in the main section of the file (see below);
else this data field will be ignored. Reference annotation lines are
currently somewhat inconsistently used. The only major use in HMMER is
to specify which columns of an alignment get turned into match states
when using the
hmmbuild -hand manual model construction option. Reference
annotation can only be picked up from SELEX format alignments. See
description of SELEX format for more details on reference annotation
lines. Optional; assumed to be no if not present.
- [CS <s>] Consensus structure annotation flag;
<s> must be either no or yes (case insensitive). If set to yes, a character
of consensus structure annotation is read for each match
state/consensus column in the main section of the file (see below);
else this data field will be ignored. Consensus structure annotation
lines are currently somewhat inconsistently used. Consensus structure
annotation can only be picked up from SELEX format alignments. See
description of SELEX format for more details on consensus structure
annotation lines. Optional; assumed to be no if not present.
- [COM <s>] Command line log; <s> is a one-line
command. There may be more than one COM line per save
file. These lines record the command line for every HMMER command that
modifies the save file. This helps us automatically log Pfam
construction strategies, for example. Optional.
- [NSEQ <d>] Sequence number; <d> is a nonzero
positive integer, the number of sequences that the HMM was trained on.
This field is only used for logging purposes.
Optional.
- [DATE <s>] Creation date; <s> is a date string.
This field is only used for logging purposes.
Optional.
- [XT <d>*8] Eight ``special'' transitions for
controlling parts of the algorithm-specific parts of the Plan7 model.
The null probability used to convert these back to model probabilities
is 1.0. The order of the eight fields is N
B, N
N, E
C, E
J, C
T, C
C, J
B, J
J. (Another
way to view the order is as four transition probability distributions
for N,E,C,J; each distribution has two probabilities, the first one
for ``moving'' and the second one for ``looping''.) For an explanation
of these special transitions (and definition of the state names), read
the Plan7 architecture documentation.
Mandatory.
- [NULT <d> <d>] The transition probability distribution
for the null model (single G state). The null probability used to
convert these back to model probabilities is 1.0. The order is G
G, G
F.
Mandatory.
- [NULE <d>*K] The symbol emission probability
distribution for the null model (G state); consists of K (e.g. 4 or
20) integers. The null probability used to convert these back to model
probabilities is 1/K. (Yes, it's a little weird to have a ``null
probability'' for the null model symbol emission probabilities; this
is strictly an aesthetic decision, so one can look at the null model
and easily tell which amino acids are more common than chance
expectation in the background distribution.)
Mandatory.
- [EVD <f> <f>] The extreme value distribution
parameters
and
,
respectively; both floating point
values.
is positive and nonzero. These values are set when
the model is calibrated with hmmcalibrate. They are used to
determine E-values of bit scores. If this line is not present,
E-values are calculated using a conservative analytic upper bound.
Optional.
- [HMM ] HMM flag line; flags the end of the header
section. Otherwised not parsed. Strictly for human readability, the
symbol alphabet is also shown on this line, aligned to the NULE
fields and the fields of the match and insert symbol emission
distributions in the main model. The immediately next line is also an
unparsed human annotation line: column headers for the state
transition probability fields in the main model section that follows.
Both lines are mandatory.
All the remaining fields are mandatory.
The first line in the main model section is atypical; it contains
three fields, for transitions from the B state into the first node of
the model. The only purpose of this line is to set the B
D transition probability. The first field is the score
for
.
The second field is always ``*'' (there is no B
I transition). The third field is the score for
.
The null probability used for converting these
scores back to probabilities is 1.0. In principle, only the third
number is needed to obtain
.
In practice, HMMER
reads both the first and the third number, converts them to
probabilities, and renormalizes the distribution to obtain
.
The remainder of the model has three lines per node, for M nodes
(where M is the number of match states, as given by the LENG
line). These three lines are:
The insert emission and state transition lines for the final node Mare special. Node M has no insert state, so all the insert
emissions are given as ``*''. (In fact, this line is skipped by the
parser, except for its RF annotation.) There is also no next node, so
only the B
M and M
E transitions are
valid; the first seven transitions are always ``*''. (Incidentally,
the M
E transition score for the last node is always 0,
because this probability has to be 1.0.)
Finally, the last line of the format is the ``//'' record separator.
After the parser reads the file and converts the scores back to
probabilities, it renormalizes the probability distributions to sum to
1.0 to eliminate minor rounding/conversion/numerical imprecision
errors. If you're trying to emulate HMMER save files, it might be
useful to know what HMMER considers to be a probability
distribution. See
Plan7Renormalize() in plan7.c for the relevant
function.
- [null emissions] The K symbol emissions
given on the NULE line.
- [null transitions] The two null model transitions
given on the NULT line.
- [N,E,C,J specials] Each of the four special states N,E,C,J have two
state transition probabilities (move and loop). All four distributions
are specified on the XT line.
- [B transitions] M B
M entry probabilities are given by the 9th field in the state
transition line of each of the M nodes. The B
D
transition (from the atypical first line of the main model section) is
also part of this state transition distribution.
- [match transitions] One distribution of 4 numbers per node;
,
,
,
and
(fields 2,
3, 4, and 10 in the state transition line of each node). Note the
asymmetry between B
M and M
E; entries are
a probability distribution of their own, while exits are not.
- [insert transitions] One distribution of 2 numbers per node;
,
(fields 5 and 6 of the state transition line of each
node).
- [delete transitions] One distribution of 2 numbers per
node;
,
(fields 7 and 8 of the
state transition line of each node).
- [match emissions] One distribution of K numbers
per node; the K match symbol emissions given on the first line of
each node in the main section.
- [insert emissions] One distribution of K numbers
per node; the K insert symbol emissions given on the second line of
each node in the main section.
Though I make an effort to keep this documentation up to date, it may
lag behind the code. For definitive answers, please check the parsing
code in hmmio.c. The relevant function to see what's being
written is WriteAscHMM(). The relevant function to see how it's
being parsed is read_asc20hmm().