Table of Contents
List of Examples
modhmm is software tool for building, training and scoring hidden Markov models. The software is open source and licensed under the GPL license. Building a hidden Markov model is done using a modular approach which means that there is a set of predefined model subparts (modules) to choose from when putting together a complete hmm. The idea is to simplify the building of large hmms while still allowing for hmms of arbitrary architecture to be built. The model building tool is called modhmmc.
modhmm consists of three main subparts, modhmmc, modhmmt and modhmms for creating training and scoring hmms respectively. Training and scoring can be done using either single sequences, multiple sequence alignments or sequence profiles as training/scoring input.
The current state of modhmm is still somewhat preliminary.
The primary URL for this document is http://modhmm.sourceforge.net.
There are two different formats for storing modhmm-models. These formats are in most aspects identical. Their differences are associated to the use of multiple alphabets or not. The multiple alphabet format (up to four different parallel alphabets are possible at the time) includes some additional entries for this information which are not present in the single alphabet format. Every model created by modhmmc or modhmmt is saved in the .hmg text format. It is fairly sensitive to minor changes as adding extra blank lines, extra blanks within a line, etc. This type of changes may work, but nothing is guaranteed. Therefore caution is decreed when manually editing a .hmg file
The header of a .hmg file contains 14 lines (+ 2 compulsory blank lines).
***********************Header***************************** NAME: tutorialHMM TIME OF CREATION: May 20, 2003 11:10:29 AM ALPHABET: A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y; ALPHABET LENGTH: 20 NR OF MODULES: 93 NR OF VERTICES: 135 NR OF TRANSITIONS: 283 NR OF DISTRIBUTION GROUPS: 7 NR OF TRANSITION TIE GROUPS: 42 NR OF EMISSION PRIORFILES: 2 EMISSION PRIORFILES: ./amino_1.pri ./amino_2.pri NR OF TRANSITION PRIORFILES: 0 TRANSITION PRIORFILES:
Description of headers
the name of the model
time for last modification of the model
the alphabet of the emissions, each letter separated by ';'
the number of letters in the alphabet
the number of hmm modules
the number of states
the total number of transitions
the number of distribution groups. A distribution group is a set of states whose emission probabilities have been tied together so that during training updating, they are regarded as the same state.
the number of tied transitions. A transition tie is the same thing as a distribution group, but for transitions.
number of emission priorfiles
names (and paths) of emission priorfiles. An emission priorfile is a file with prior information over the emissions used to weight the observed emission frequences against a belief prior to the observation.
number of transition priorfiles
names (and paths) of transition priorfiles. Same thing as an emission priorfile, but for transitions.
Each module section contains 4 rows in the beginning (+ a compulsory blank row) and a set of vertex sections, each separated by a blank row. Each module section is ended by a blank row and a row of '-'.
Module: module1 Type: Singlenode NrVertices: 1 Emission prior file: ./amino.pri Transition prior file: null Vertex 70: Vertex type: standard Vertex label: M Transition prior scaler: 1.0 . . . -------------------------------------------------------
Description of modules variables
the name of the module
the type of the module ( see Section 3.3.1.2, “Modules” )
number of states in this module
possible emission prior file associated with this module, 'null' means that no file is associated
possible transition prior file associated with this module, 'null' means that no file is associated
Each vertex section consists of 9 initial rows, followed by a sections for transition probabilities, end transition probabilities and emission probabilities respectively.
Vertex 1: Vertex type: standard Vertex label: d Transition prior scaler: 1.0 Emission prior scaler: 1.0 Nr transitions = 1 Nr end transitions = 0 Nr emissions = 20 Transition probabilities Vertex 2: 1.0 End transition probabilities Emission probabilities A: 0.05 C: 0.05 . . . T: 0.05 V: 0.05 W: 0.05 Y: 0.05
the number of the state
the type of the state. 3 types exist: standard, silent and locked. A standard state is an regular emitting state. A silent state is a state that does not emit any symbols. Finally a locked state is an emitting state for which the emission probabilities are fixed, i.e. will not be updated during training.
the transition prior scaler is a factor which describes how much weight to put on the possible prior distribution associated with this vertex
the emission prior scaler is a factor which describes how much weight to put on the possible Dirichlet prior mixture associated with this vertex.
is the number of transitions from this state (except to end states)
is the number of transitions from this state to end states
is the number of different possible emissions from this state, usually equal to the alphabet size.
has a row for each of the transitions which states which state the transition is to and the probability of the transition
Transition probabilities Vertex 2: 0.4 Vertex 3: 0.6
same as above, but for end transitions, usually there are not more than one of this type of transition, but nothing in the program prohibits this
End transition probabilities Vertex 4: 1.0
the probabilities for emitting the different letters in the alphabet, one row for each letter in the alphabet. In the case of continuous emissions, the alphabet is interpreted by the program as follows. The letters are divided into groups of 3. Each group describe a mixture component of a mixture of one-dimensional normal distributions. For each group, the first letter represents the mean value, the second the variance and the third the coefficient for the particular mixture component.
Emission probabilities A: 0.05 C: 0.05 D: 0.05 E: 0.05 . . . Y: 0.05Continuous (2 mixture components):
Emission probabilities m1: 22.5 var1: 0.34 co1: 0.45 m2: -22.3 var2: 3.11 co2: 0.55
The emission distribution group has a line for each distribution group which simply states the numbers of the vertices that belong to a particular group
Group 1: 1 2 Group 2: 3 4
Download the software from the sourceforge project page. The latest version of modhmm is 1.1.0.
To install modhmm on Windows, first download the modhmm-1.1.0-win32.exe and then execute the file ( click on it ).
To install modhmm on Ubuntu or Debian, first download the modhmm-1.1.0.deb and then log in as root and
# dpkg -i modhmm-1.1.0.deb
To install modhmm on Centos or Debian, first download the modhmm-1.1.0.Linux.rpm and then log in as root and
# yum localinstall modhmm-1.1.0.Linux.rpm
To install modhmm on a Mac OS X v10.5 ( Leopard ) on a Mac computer with Intel cpu, first download the modhmm-1.1.0-MacOSX10.5.tar.gz and then
$ tar xfz modhmm-1.1.0-MacOSX10.5.tar.gz
To install modhmm on a Mac OS X v10.4 ( Tiger ) on a Mac computer with Intel cpu, first download the modhmm-1.1.0-MacOSX10.4.tar.gz and then
$ tar xfz modhmm-1.1.0-MacOSX10.4.tar.gz
To build modhmm on Unix ( e.g. Linux, MacOSX, CygWin ) you need to have this installed
If you have the modhmm source code in the directory /tmp/modhmm
and you want to install modhmm into the directory /tmp/install
, you
First run cmake then make and then make install
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DCMAKE_INSTALL_PREFIX=/tmp/install /tmp/source && make && make install -- Configuring done -- Generating done -- Build files have been written to: /tmp/build Scanning dependencies of target fastdist [ 3%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist.o [ 6%] Building CXX object src/c++/CMakeFiles/fastdist.dir/BitVector.o [ 9%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Exception.o [ 12%] Building CXX object src/c++/CMakeFiles/fastdist.dir/InitAndPrintOn_utils.o [ 15%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Object.o [ 18%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Sequence.o [ 21%] Building CXX object src/c++/CMakeFiles/fastdist.dir/SequenceTree.o [ 25%] Building CXX object src/c++/CMakeFiles/fastdist.dir/SequenceTree_MostParsimonious.o [ 28%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Simulator.o [ 31%] Building CXX object src/c++/CMakeFiles/fastdist.dir/arg_utils_ext.o [ 34%] Building CXX object src/c++/CMakeFiles/fastdist.dir/file_utils.o [ 37%] Building CXX object src/c++/CMakeFiles/fastdist.dir/stl_utils.o [ 40%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/DNA_b128_String.o [ 43%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/Sequences2DistanceMatrix.o [ 46%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_LeafLifting.o [ 50%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_given_edge_probabilities.o [ 53%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_local_improve.o [ 56%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_star.o [ 59%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/Big_AML.o [ 62%] Building CXX object src/c++/CMakeFiles/fastdist.dir/distance_methods/LeastSquaresFit.o [ 65%] Building CXX object src/c++/CMakeFiles/fastdist.dir/distance_methods/NeighborJoining.o [ 68%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/Kimura2parameter.o [ 71%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/TamuraNei.o [ 75%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/ambiguity_nucleotide.o [ 78%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/dna_pairwise_sequence_likelihood.o [ 81%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/string_compare.o [ 84%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DistanceMatrix.o [ 87%] Building C object src/c++/CMakeFiles/fastdist.dir/arg_utils.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 90%] Building C object src/c++/CMakeFiles/fastdist.dir/std_c_utils.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 93%] Building C object src/c++/CMakeFiles/fastdist.dir/DNA_b128/sse2_wrapper.o [ 96%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/computeTAMURANEIDistance_DNA_b128_String.o [100%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/computeDistance_DNA_b128_String.o Linking CXX executable fastdist [100%] Built target fastdist [100%] Built target fastdist Linking CXX executable CMakeFiles/CMakeRelink.dir/fastdist Install the project... -- Install configuration: "" -- Install configuration: "" -- Installing /tmp/install/bin/fastdist -- Install configuration: ""
If you want to build the html documentation ( i.e. this page ) you need to pass the -DBUILD_DOCBOOK=ON option to cmake.
This is section is mainly intended for package maintainers
To build the modhmm nullsoft installer package ( modhmm-1.1.0-win32.exe ) you need to have this installed
on your Windows machine.
Just open up a msys bash shell
$ mkdir tmpbuild $ cd tmpbuild $ cmake path/to/the/modhmm/source/code -DSTATIC=ON -G "MSYS Makefiles" && make win32installer
The source code for gengetopt, libz and libxml will be automatically downloaded and built statically.
On a CentOS or Fedora machine, first log in as root and install the dependencies
# yum install xmlto libxml2-devel cmake gcc-c++ binutils gengetopt
Check that cmake is version 2.6 or later
$ cmake --version cmake version 2.6-patch 0
If it is older you could download a cmake binary directly from www.cmake.org
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package
On a Debian or Ubuntu machine, first log in as root and install the dependencies
# apt-get install libxml2-dev cmake g++ binutils gengetopt
Check that cmake is version 2.6 or later
$ cmake --version cmake version 2.6-patch 0
If it is older you could download a cmake binary directly from www.cmake.org. Now build the deb package.
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package
To build the modhmm install package for MacOS X you need to have this installed
on your MacOS X computer.
Check that cmake is version 2.6 or later
$ cmake --version cmake version 2.6-patch 0
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DSTATIC=ON -DCPACK_GENERATOR="TGZ" /tmp/source && make package
modhmmc is the program for designing an HMM. The main modhmmc program has a command line based interactive user interface which lets the user specify alphabet, states, transition and emission probabilities, etc. If an hmm with one alphabet is designed, it will automatically be saved in the one-alphabet format. If an hmm with multiple alphabets is designed, it will automatically be saved in the multiple alphabet format
Type modhmmc --help
to see the command line options
[user@saturn ~]$ couldn't xinclude file couldn't xinclude file
The states of an HMM are specified as a collection of state modules with transitions between them. A module is a set of states which are interconnected in a predecided fashion. The idea behind modules is to make the creation of large HMMs easier. HMMs with several hundred states are not uncommon in sequence analysis and specifying each and every state and transition in such a case is impractical. modhmmc currently has 7 module types to choose from: singlenode, singleloop, forward std, forward alt, cluster, profile7 and profile9 ). modhmmc allows for the creation of both regular and silent states.
Transition probabilities are set by default inside the modules to correspond to the intrinsic properties of that module (see descriptions of the modules). Transition probabilities between modules are by default set uniformly, so that the probabilities of going from a module to another are equally distributed among all its neighbours. Emission probabilities for each state are either set manually by the user or according to a chosen distribution. The choices are to set the values uniformly, randomly or to zero for all letters, which creates a silent state. There are also some special distributions that are particular to certain states.
The singlenode module is the most basic module of modhmmc. It consists of only one state, emitting by default, but the user may specify it as silent. All other modules may be built using collections of singlenodes. The singlenode has no input parameters.
The singlenode module is the most basic module of modhmmc. It consists of only one state, emitting by default, but the user may specify it as silent. All other modules may be built using collections of singlenodes. The singlenode has no input parameters.
The singleloop module consists of one state with a transition to itself. It is necessarily emitting, since no loops of
silent states are allowed. The singleloop has one input parameter. The user specifies the expected length of the loop, which
sets the loop transition probability to
The forward module is a set of states connected in a straight line, indexed 1 to n
. All states are emitting by default.
The two input parameters (m,n)
specify the shortest and longest possible routes through the module. The number of states in the module is equal to the
length of the longest possible route n
. For the state with index m
-1 there is a transition to all states in the module with
a higher index number up to and including the state with index n
.
The transition probabilities are by default set so that the total probability is equal for all possible paths through the module,
but it is also possible to set the length probilities according to a binomial distribution, with the base either in the shortest
or the longest path.
Any state that connects to this module will connect to the state with index 1, and all outgoing connections from this module
go from the state with index n
.
The forward module is a set of states connected in a straight line, indexed 1 to n
. All states are emitting by default.
The two input parameters (m,n)
specify the shortest and longest possible routes through the module. The number of states in the module is equal to the
length of the longest possible route n
. For all states with index m
-1 to n
-2 there is a transition to the last state.
The transition probabilities are set so that the total probability is equal for all possible paths through the module,
but it is also possible to set the length probilities according to a binomial distribution, with the base either in the shortest
or the longest path.
Any state that connects to this module will connect to the state with index 1, and all outgoing connections from this module
go from the state with index n
.
The cluster module is a fully interconnected set of states. Every state has a transition to every other state. The transition probabilities are evenly distributed by default. All states are emitting. The input parameter is the number of states. Incoming transitions connect to all states, and outgoing connections go from all states.
The profile9 module is equal to the standard profile-HMM architecture. The input parameter specifies the length of the module, i.e. the number of match states. Incoming transitions connect to the an initial silent state. From this state there are transitions to an pre-model insert state and to the first match and delete states of the model. For the purpose of local alignment with respect to the model there are also transitions to the following match states. These may be set to zero when global alignment is prefered. From the last match and delete states there are transitions to a silent state at the end. There are also transintions from the previous match states to this state for the purpose of local alignments. These are set to zero when global alignment is used. This silent end-state has a transition to an insert state for the latter unaligned part of a sequence and to the next module. Transition parameters are set automatically to default values at this stage. For a profile HMM, they may be updated using the opt_prfhmm program.
The profile7 module is equal to the profile9 module in every way, except for it not having any delete ⟹ insert or insert ⟹ delete transitions.
The U-turn module is a set of states that models a symbol sequence of arbitrary length. The "bottom
of the U" is represented by a single node with a transition to itself, while the states in the leg leading to
the bottom of the U have transitions both to the next state towards the bottom and to two states on the other leg of the U.
The states in the leg leading out of the U have a transition to the next state on the path out from the U.
At initialization all transition probabilities are set to 0.5. The length
parameter defines the number of
states in each leg.
Specify file name of HMM. modhmmc will add '.hmg'
Specify number of alphabets
The alphabet of an HMM is specified as a set of letters, where each letter is a word of up to 4 characters. etc
If multiple alphabets exist, these must also be specified
...
...
Specify name of start state
Specify module type, name and label
Specify name of end state
Specify transitions from state a
Tie the emission probabilities of (all states of) the specified modules to each other. This means that when updating the value of the emission probabilities during training, all states in a distribution group will be treated as one state.
Tie the transition probabilities of the specified modules to each other. Modules must be identical (same type and size) for this to work. Tying two modules means that the transitions in the two modules which corresponds to each other will be updated as the one transition when transition probabilities are updated during training.
Initialize the emission probabilities and tie the states to specific prior distribution files (see ??? ). Also specify the weight of the prior against the observations, default is 1.0. This is done for each alphabet.
Same as above, but for transitions. (prior files and weights not implemented in training and scoring algorithms yet).
The modhmmt program is is used for parameter optimization. It implements the regular Baum-Welch training algorithm as well as Conditional maximum likelihood (CML) training, both for single sequences, multiple sequence alignments and sequence profiles.
Type modhmmt --help
to see the command line options
[user@saturn ~]$ modhmmt --help modhmmt 1.1.0 train a hidden markov model Usage: modhmmt [OPTIONS]... -h, --help Print help and exit -V, --version Print version and exit -i, --hmminfile=filename modelfile (in .hmg format) -s, --seqnamefile=filename sequence namefile (for sequences in fasta, smod, msamod or prfmod format) -f, --seqformat=STRING format of input sequences (fa=fasta, s=smod, msa=msamod, prf=prfmod) -o, --outfile=filename model outfile -q, --freqfile=filename background frequency file -x, --smxfile=filename substitution matrix file -r, --replfile=filename replacement letter file -a, --alg=STRING training algorithm (cml=conditional maximum likelihood, bw=baum-welch (default), disc=discriminative training) -n, --negseqnamefile=filename sequence namefile for negative training sequences (for sequences in fasta, smod, msamod or prfmod format) -z, --optalpha=STRING alphabet to optimize (parameters for transitions and all other alphabets will not be changed -M, --msascoring=STRING scoring method for alignment and profile data options = DP/DPPI/GM/GMR/DPPI/PI/PIS default=GM -c, --usecolumns=STRING specify which columns to use for alignment input data, options = all/nr, where all means use all columns and nr specifies a sequence in the alignment and the columns where this sequence have non-gap symbls are used default = all --nolabels do not use labels even though the input sequences are labeled (default=off) --noprior do not use priors when training even though the the model file has prior files specified (default=off) --tpcounts use pseudocounts for transition parameter updates (default=off) --epcounts use pseudocounts for emission parameter updates (default=off) --transonly only update transition parameters (default=off) --emissonly only update emission parameters (default=off) -v, --verbose print some information about what is going on (default=off)
The modhmms program is used for scoring sequences, multiple sequence alignments and sequence profiles against an hmm. The algorithms implemented includ forward, Viterbi and 1-best. Output can be either a log-likelihood/logodds/reverse score, the (approximatley) most probable labeling of a sequence or the most probale state path.
Type modhmms --help
to see the command line options
[user@saturn ~]$ modhmms --help modhmms 1.1.0 score sequences on hidden markov models Usage: modhmms [OPTIONS]... -h, --help Print help and exit -V, --version Print version and exit -m, --hmmnamefile=filename model namefile for models in hmg format -s, --seqnamefile=filename sequence namefile (for seuences in fasta, smod, msamod or prfmod format) -f, --seqformat=STRING format of input sequences (fa=fasta, s=smod, msa=msamod, prf=prfmod) -o, --outpath=dir output directory -q, --freqfile=filename background frequency file -x, --smxfile=filename substitution matrix file -r, --replfile=filename replacement letter file -p, --priorfile=filename sequence prior file (for msa input files) -n, --nullfile=filename null model file -a, --anchor=STRING hmm=results are hmm-ancored (default), seq=results are sequence anchored -L, --labeloutput output will print predicted labeling and posterior label probabilities (default=off) -A, --alignmentoutput output will print log likelihood, log odds and reversi scores (default=off) -M, --msascoring=STRING scoring method for alignment and profile data options = DP/DPPI/GM/GMR/DPPI/PI/PIS default=GM -c, --usecolumns=STRING specify which columns to use for alignment input data, options = all/nr, where all means use all columns and nr specifies a sequence in the alignment and the columns where this sequence have non-gap symbls are used default = all --nolabels do not use labels even though the input sequences are labeled (default=off) -v, --verbose print some information about what is going on (default=off) Group: score_algs --viterbi Use viterbi algorithm for alignment and/or label scoring (default no) --nbest Use n-best (=1-best) algorithm for label scoring (default yes) --forward Use forward algorithm for alignment scoring (default yes) --max_d Retrain model on each sequence using Baum-Welch before scoring (default=off) --savehmm Save retrained HMM to file (default=off) options for specific output control: --path Print most likely statepath (default=off) --nopostout no posterior probability information for label scoring (default=off) --nolabelout no predicted labeling for label scoring (default=off) --nollout no log likelihood score for alignment scoring (default=off) --nooddsout no log odds score for alignment scoring (default=off) --norevout no reversi score for alignment scoring (default=off) --alignpostout print posterior probability information for alignment scoring (default=off) --alignlabelout print predicted labeling for alignment scoring (default=off) --labelllout print log likelihood score for label scoring (default=off) --labeloddsout print log odds score for label scoring (default=off) --labelrevout print reversi score for label scoring (default=off)
The modseqalign program aligns two sequences/multiple alignments/profiles using their most proable state paths through a given hmm.
Type modseqalign --help
to see the command line options
[user@saturn ~]$ modseqalign --help modseqalign 1.1.0 align 2 sequences to each other using their most likely state path through a given HMM Usage: modseqalign [OPTIONS]... -h, --help Print help and exit -V, --version Print version and exit -m, --hmmfile=filename model namefile for models in hmg format -s, --target=filename sequence namefile (for seuences in fasta, smod, msamod or prfmod format) -t, --template=filename sequence namefile (for seuences in fasta, smod, msamod or prfmod format) -f, --seqformat=STRING format of input sequences (fa=fasta, s=smod, msa=msamod, prf=prfmod) -o, --outfile=filename model outfile -q, --freqfile=filename background frequency file -x, --smxfile=filename substitution matrix file -r, --replfile=filename replacement letter file -p, --priorfile=filename sequence prior file (for msa input files) -M, --msascoring=STRING scoring method for alignment and profile data options = DP/DPPI/GM/GMR/DPPI/PI/PIS default=GM -c, --usecolumns=STRING specify which columns to use for alignment input data, options = all/nr, where all means use all columns and nr specifies a sequence in the alignment and the columns where this sequence have non-gap symbls are used default = all --nolabels do not use labels even though the input sequences are labeled (default=off) -v, --verbose print some information about what is going on (default=off)
The add_alphabet program is used for scoring sequences, multiple sequence alignments and sequence profiles against an hmm. The algorithms implemented includ forward, Viterbi and 1-best. Output can be either a log-likelihood/logodds/reverse score, the (approximatley) most probable labeling of a sequence or the most probale state path.
Type add_alphabet --help
to see the command line options
[user@saturn ~]$ add_alphabet --help add_alphabet 1.1.0 add alphabet with predefined emission probabilities to a given model file Usage: add_alphabet [OPTIONS]... -h, --help Print help and exit -V, --version Print version and exit -i, --hmminfile=filename modelfile (in .hmg format) -o, --outfile=filename model outfile -a, --alphafile=filename alphabet file -v, --verbose print some information about what is going on (default=off)
There are a few different types of input files accociated with the modhmm package, mainly files for sequences and various utility files. There are 4 different sequence file formats (fasta, single, multi and profile). The profile format is compulsory to contain labels, while for the other formats labels are optional. Each sequence file may contain only one sequence, but a sequence may consist of up to 4 different parallel alphabets.
The standard fasta sequence file format. Fasta sequences can only consist of one alphabet.
Labels can also be used in this format ( see Section 3.4.1.4, “Labels” ).
The Modhmm s discrete format is a single sequence file format specific for modhmm, designed to allow for an alphabet which allows letters with more than one character. '<' is the signal for starting a sequence, '>' is the signal for ending it and ';' is the end-of-letter marker. Note that this marker must also be present after the last letter in a sequences, i.e. right before the '>' sign. A sequence may occupy more than one line and can be of arbitrary length. Anything written outside of a <> marking will be ignored.
The multiple alphabet version of this format has the same basic structure as the single alphabet version. Each sequence
consists of 2-4 subsequences which begin and end with '<' and '>' respectively. The number of '<' characters in the
beginning is used to indicate which alphabet a subsequence
belongs to. The subsequences must be in numerical order, and they must be of equal lengths.
Example 3. seq-multiple-alphabet.modhmm-s-discrete, a multiple-alphabet example file in the Modhmm s discrete format
The example file seq-multiple-alphabet.modhmm-s-discrete contains
<A;B;C;D;E;F;G;H;> <<a;b;c;d;e;f;g;h;>
Labels can also be used in this format ( see Section 3.4.1.4, “Labels” ).
The Modhmm msa discrete format is the same format as the Modhmm s discrete format ( Section 3.4.1.2, “Modhmm s discrete format” ) but for multiple sequence alignments. This means that one msa sequence file contains a set of sequences aligned to each other. For this purpose, each sequence must be situated in one line only. '-', '_', ' ' and '.' can all be used to signify a gap. All sequences in a msa sequence file are regarded as belonging to the same multiple sequence alignment. Each line must be as long as or shorter than the line containing the template sequence when such a sequence is used. The starting position of an alignment is calculated from the first letter after '<', which means that it is not possible to skip letters in the begining of a sequence. These must be represented as gaps. However, it is possible to do this at the end of a sequence.
Example 4. seq.modhmm-msa-discrete, an example file in Modhmm msa discrete format
The example file seq.modhmm-msa-discrete contains
<A;B;C;D;E;F;G;H;> <A;-;C;D;E;P;G;H;> <A;B;-;-;E; ; ; ;> <A;B;C;D;-;F>
It uses a single alphabet.
This however will not work:
<A;B;C;D;E;F;G;H;> <A;-;C;D;E;P;G;H;> <A;B;-;-;E; ; ; ;> <C;D;-;F>
'C' in the fourth line will be aligned with the 'A'-column, 'D' with the 'B'-column and so on.
Example 5. seq-multiple-alphabet.modhmm-msa-discrete, an example file in the Modhmm msa discrete format
The example file seq-multiple-alphabet.modhmm-msa-discrete contains
<A;B;C;D;E;F;G;H;> <A;-;C;D;E;P;G;H;> <A;B;-;-;E; ; ; ;> <<a;b;c;d;d;f;g;h;> <<a;-;c;d;r;f;g;h;> <<a;b;-;-;e; ; ; ;>
It uses multiple alphabets.
Labels can also be used in this format ( see Section 3.4.1.4, “Labels” ).
A line containing a labeling of the sequence may be added to sequences in fa, s or msa format. The labeling is represented as a line at the end of the file (enclosed in by '/' characters) with one label character for each sequence position. For a fasta sequence:
>seqname1 ACDEFGHIKLMNPQRSTVWY /..................../
For an s sequence:
>seqname1 <A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;> /..................../
For an msa sequence:
>seqname1 <A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;> <A;C;D;T;F;T;H;I;K;-;-;N;P;Q;R;S;T;-;W;Y;> <-;-;-;-;-;-;-;-;-;-;-;N;P;Q;R;S;T;-;Y;Y;> /..................../
A label must only consist of one character. The character '.' is predefined as representing any label, which means that a sequence letter labeled with '.' will match any state label. For multiple alphabets the principle is the same as for unlabeled alignment sequences, and the labels are placed at the bottom (only once, since the labeling is the same over all alphabets).
The Modhmm s continuous format is the continuos version of the Modhmm s discrete format. All lines start and end with '#' and '+' respectively to mark that the alphabet is continuous. Otherwise, there is no difference from the discrete std format.
The multiple alphabet version of this format has the same basic structure as the single alphabet version. Each sequence
consists of 2-4 subsequences which begin and end with '#' and '+' respectively. The number of '#' characters in the
beginning is used to indicate which alphabet a subsequence
belongs to. The subsequences must be in numerical order, and they must be of equal lengths. It is also possible to
mix discrete and continuous alphabets. This is simply done by using the indicators '#'- '+' and '<' and '>' respectively.
Example 7. seq-multiple-alphabet.modhmm-s-continuous, an example file in the Modhmm s continuous format
The example file seq-multiple-alphabet.modhmm-s-continuous contains
<A;W;A;A;> ##0.11;2.99;-1.00;-10.09;+
Labeling sequences from a continuous alphabet works in exactly the same way as for sequences from a discrete alphabet.
Alignments of sequences from continuous alphabets do not exist. However, when using multiple alphabets, single sequences from continuous alphabets may be used together with alignments of ones from discrete alphabets. Here, 'X' is a wildcard letter which stands for anything (emission probability 1.0).
Example 8. seq.modhmm-msa-continuous, an example file in Modhmm msa continuous format
The example file seq.modhmm-msa-continuous contains
<A;B;C;D;E;F;G;H;> <A;-;C;D;E;P;G;H;> <A;B;-;-;E; ; ; ;> <A;B;C;D;-;F> <<s;s;h;h;f;f;s;s> <<h;-;h;h;h;h;h;h;> <<s;s;-;-;q; ; ; ;> <<h;h;h;h;-;h> ###0.01;0.05;0.99;0.88;X;0.11;-1.99;-2.99;+
The Modhmm prf format is slightly different form the other formats as it does not contain letters, but rather frequencies at each position. Labels are compulsory for this sequence format.
Example 9. seq.modhmm-prf, an example file in Modhmm msa prf format
The example file seq.modhmm-prf contains
Sequence: 1c3wA NR of aligned sequences: 4 Length of query sequence: 5 START 1 ALPHABET: A C D E - <SPACE> <LABEL> <QUERY> COL 1: 0.00 0.00 0.00 0.00 0.00 30.00 . T COL 2: 0.00 0.00 0.00 0.00 0.00 28.00 . G COL 3: 0.00 0.00 0.00 0.00 0.00 28.00 . R COL 4: 0.00 0.00 0.00 0.00 0.00 27.00 . P COL 5: 1.00 0.00 2.00 9.00 0.00 23.00 . E END 1 START 2 ALPHABET: c M m - <SPACE> <LABEL> <QUERY> COL 1: 0.00266 0.08577 0.05604 0.00000 0.00000 . X COL 2: 0.00271 0.11234 0.13852 0.00000 0.00000 . X COL 3: 0.00240 0.20539 0.20777 0.00000 0.00000 . X COL 4: 0.00304 0.26166 0.19852 0.00000 0.00000 . X COL 5: 0.00419 0.26735 0.23101 0.00000 0.00000 . X END 2
The Modhmm replacement letter format is fairly strict. All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line must be the number of alphabets, the second line must be a number giving the size of the first alphabet. The third line must be the characters of the alphabet, separated by and ending with ';'. The fourth line is the number of the replacement letters. Next, depending on the number in the fourth line, there are a number of lines with the specification of the replacement letters for that alphabet. Then the next line contains the number giving the size of the second alphabet if it exists, etc.
The specification lines have the following layout: <alphabet letter> = <substitution> <substitution> <substitution> ... . The substitutions have the form: <replacement letter>:<probabilityshare>.
Example 10. file.modhmm-replacement-letter, an example file in Modhmm replacement letter format
The example file file.modhmm-replacement-letter contains
# Nr of alphabets 2 # Alphabet 1 size 20 # Alphabet 1 A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y; # Nr replacement letters 1 3 # Replacement letters 1 U = M:1.0 B = D:0.5 N:0.5 Z = Q:0.5 E:0.5 # Alphabet 2 size 3 # Alphabet 2 s;h;e; # Nr replacement letters 2 0 # Replacement letters
The Modhmm subst mtx format specifies a substitution matrix as defined in Section 4.1, “Emission scoring methods”. All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line must be a number giving the size of the alphabet of the associated hmm. The second line must be the characters of the alphabet, separated by and ending with ';'. The following lines specify the matrix itself. Each line has the following layout: <alphabet letter> = <score alphabet letter 1> <score alphabet letter 2> ... . There is one line for each letter of the alphabet, not in any particular order, and one line at the end that describes how to interpret letters in the sequences not in the alphabet. This line is only used under certain circumstances in the SMDP and SMDPP scoring methods. There is one column for each letter of the alphabet and the columns must be in the same order as the alphabet specification of the hmm.
Example 11. file.modhmm-subst-mtx, an example file in Modhmm subst mtx format
The example file file.modhmm-subst-mtx contains
#Nr of alphabets 2 # Alphabet size 20 # Alphabet A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y; # Matrix letters A = A:1.0 C:0.0 D:0.0 E:0.0 F:0.0 ... C = A:0.0 C:1.0 D:0.0 E:0.0 F:0.0 ... D = A:0.0 C:0.0 D:1.0 E:1.0 F:0.0 ... E = A:0.0 C:0.0 D:1.0 E:1.0 F:0.0 ... F = A:0.0 C:0.0 D:0.0 E:0.0 F:1.0 ... G = A:0.0 C:0.0 D:0.0 E:0.0 F:0.0 ... H = A:0.0 C:0.0 D:0.0 E:0.0 F:0.0 ... . . . X = A:0.05 C:0.05 D:0.05 E:0.05 ... #Alphabet 2 size 3 #Alphabet 2 s;h:e; # Matrix letters 2 s = s:1.0 h:0.0 e:0.0 h = s:0.0 h:1.0 e:0.0 e = s:0.0 h:0.0 e:1.0 x = s_0.33 h: 0.33 e :0.34
The Modhmm frequency format specifies the background frequencies of the alphabet letters in some chosen population as defined in Section 4.1, “Emission scoring methods”. All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line must be a number giving the size of the alphabet of the associated hmm. The following lines specify the frequencies. There is one line for each letter of the alphabet, in the same order as the alphabet is listed in the hmm specification.
Example 12. file.modhmm-frequency, an example file in Modhmm frequency format
The example file file.modhmm-frequency contains
#alphabet size 5 # frequencies 0.51 0.14 0.03 0.08 0.24
Note that there are no multiple alphabet frequency files. These are specified separately for each alphabet.
All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line should contain the number of mixture components. Then, for each component there is a line containing the component's probability value and a line containing the component values for each alphabet letter. Note that the order of the component values must match the alphabet order in the hmm specification, i.e. the first component value of a line will be associated with the first alphabet letter in the hmm specification, etc.
Example 13. file.modhmm-prior, an example file in Modhmm prior format
The example file file.modhmm-prior contains
9 # 9 components # Component 1 0.178091 0.270671 0.039848 0.017576 ... # Component 2 0.056591 0.021465 0.0103 0.011741 ... . . .
Note that there are no multiple alphabet priorfiles. These are specified separately for each alphabet.
All lines beginning with '#' (as the very first character), and all
empty lines are disregarded. Apart fom this, the first line should contain the alphabet size (n
). The lines
2 to n
+1, should contain the emission probabilities for the letters of the alphabet, ordered the same way as
in the hmm specification. The last line should contain the loop transition probability. The null model is
built as an hmm wit one singleloop module.
Example 14. file.modhmm-null-model, an example file in Modhmm null model format
The example file file.modhmm-null-model contains
2 # Nr of alhabets 20 # alphabet size 0.075520 # A 0.016973 # C 0.053029 # D 0.063204 # E 0.040762 # F . . . 0.012513 # W 0.031985 # Y 3 # alphabet size 2 0.33 # s 0.33 # h 0.34 # e 0.997151 # trans
The files called hmmnamefile and seqnamefile in the options are in the Modhmm name file format. These files are simply files with names of hmms/sequences inluding a relative or full path. Note that this file format is very sensitive to blank lines, blanks added at the end of lines etc. Each line should contain only the name of the hmm/sequence file.
Example 15. file.modhmm-name-file, an example file in Modhmm name file format
The example file file.modhmm-name-file contains
DATASET/FASTA_FILES/A15_HUMAN_MOLLER_C.fa DATASET/FASTA_FILES/A26161_HALSS_MPTOPO3D.fa DATASET/FASTA_FILES/A41616_HUMAN_MPTOPO3D.fa DATASET/FASTA_FILES/A42226_ECOLI_MPTOPO1D.fa DATASET/FASTA_FILES/ACC8_CRICR_TMPDB1D.fa DATASET/FASTA_FILES/ADT2_YEAST_MOLLER_C.fa DATASET/FASTA_FILES/ADTA_RICPR_MOLLER_B.fa . . .
Files in the Modhmm alphabet format are used to specify the alphabet and the emission probabilities when adding an alphabet to an already existing HMM. The first row contains the text 'ALPHABET: ' and a string containing the alphabet where the letters are separated by a ';'. The second row conatains the text 'ALPHABET LENGTH: ' and a positive integer giving the number of letters in the alphabet. The remaining rows contain the text 'VERTEX X: ' and a sequence of floating point numbers (separated by a 'SPACE' character). The sequence of floating point numbers give the emission probabilities for the alphabet letters in state X of the model. This means that there should be as many floats in the sequence of each line as there are letters in the alphabet. The order of the probabilities should follow the order of the letters in the alphabet specification from the first row. There should be one row for each state in the HMM, which the alphabet is to be added to. The sum of the numbers from one row should be either 1.0 (for emitter states) or 0.0 (for silent states) for discrete alphabets.
Example 16. file.modhmm-alphabet, an example file in Modhmm alphabet format
The example file file.modhmm-alphabet contains
ALPHABET: A;B;C; ALPHABET LENGTH: 3 VERTEX 0: 0.0 0.0 0.0 VERTEX 1: 0.1 0.1 0.80 VERTEX 2: 0.23 0.27 0.50 VERTEX 3: 0.0 0.0 0.0
There are 7 different ways to calculate the score (probability) for a multiple sequence alignment column to be emitted
in a certain state. Let us call this score
The first scoring method, DP, is the dot product
Here,
The second implementation of
As for the scoring method above, this formula only increases the time complexity of the forward and backward algorithms
with a factor the size of the alphabet.
The third scoring method, PICASSO, is a scoring method originally developed for non-HMM-based profile-profile comparison. The probabilistic interpretaion of this score, is that it (as GM) calculates the geometric mean, but an emission probability is not a pure frequency, but a frequency which is weighted according to how common this letter is in a given background distribution. In the symmetric version, the profile columns are weighted in a similar fashion. The respective columns are then normalized to sum to 1. In PICASSO, the emission score is calculated as
for the original version
[1]
or as
Equation 0.
for the symmetric version
[4].
Here,
The fourth scoring method DPPI, is a combination of DP ( Section 4.1.1, “DP” ) and PICASSO ( Section 4.1.3, “PICASSO” ). Here, a sum of products between the two elements of the two vectors are calculated as in standard DP, but each term in the sum is also divided by the bakground frequency of that letter.
The fifth implementation of
Equation 0.
where
The sixth implementation of
Equation 0.
where
The seventh implementation, SMDPP, is essentially the same as the SMDP method ( Section 4.1.6, “SMDP” ), but for all letters in a certain position for which the substitution matrix value is below a certain threshold, the sequence column shares are evened out, i.e. redistributed uniformly over these letters. The objective is to disregard the significance of a random, "unlikely" mutation.
[2] and “Hidden Markov models for sequence analysis: extension and analysis of the basic method”. Comput. Appl. Biosci.., 2):95-107, 1996. (
[3] “Hidden Markov models for labeled sequences”. Proceedings of the 12th IAPR International Conference on Pattern Recognition., Los Alamitos, California, pages 140-144, 1994 IEEE Computer Society Press.